Consistency of Distance-based Criterion for Selection of Variables in High-dimensional Two-Group Discriminant Analysis

Similar documents
Consistency of Test-based Criterion for Selection of Variables in High-dimensional Two Group-Discriminant Analysis

Consistency of test based method for selection of variables in high dimensional two group discriminant analysis

High-Dimensional AICs for Selection of Redundancy Models in Discriminant Analysis. Tetsuro Sakurai, Takeshi Nakada and Yasunori Fujikoshi

Approximate interval estimation for EPMC for improved linear discriminant rule under high dimensional frame work

Department of Mathematics, Graduate School of Science Hiroshima University Kagamiyama, Higashi Hiroshima, Hiroshima , Japan.

EPMC Estimation in Discriminant Analysis when the Dimension and Sample Sizes are Large

High-dimensional asymptotic expansions for the distributions of canonical correlations

An Unbiased C p Criterion for Multivariate Ridge Regression

T 2 Type Test Statistic and Simultaneous Confidence Intervals for Sub-mean Vectors in k-sample Problem

STRONG CONSISTENCY OF THE AIC, BIC, C P KOO METHODS IN HIGH-DIMENSIONAL MULTIVARIATE LINEAR REGRESSION

Expected probabilities of misclassification in linear discriminant analysis based on 2-Step monotone missing samples

Comparison with RSS-Based Model Selection Criteria for Selecting Growth Functions

A fast and consistent variable selection method for high-dimensional multivariate linear regression with a large number of explanatory variables

Variable Selection in Multivariate Linear Regression Models with Fewer Observations than the Dimension

A high-dimensional bias-corrected AIC for selecting response variables in multivariate calibration

Likelihood Ratio Tests in Multivariate Linear Model

Testing linear hypotheses of mean vectors for high-dimension data with unequal covariance matrices

Modfiied Conditional AIC in Linear Mixed Models

Bias Correction of Cross-Validation Criterion Based on Kullback-Leibler Information under a General Condition

Statistical Inference On the High-dimensional Gaussian Covarianc

Illustration of the Varying Coefficient Model for Analyses the Tree Growth from the Age and Space Perspectives

arxiv: v1 [math.st] 2 Apr 2016

Comparison with Residual-Sum-of-Squares-Based Model Selection Criteria for Selecting Growth Functions

Testing Block-Diagonal Covariance Structure for High-Dimensional Data

More Powerful Tests for Homogeneity of Multivariate Normal Mean Vectors under an Order Restriction

On testing the equality of mean vectors in high dimension

COMPARISON OF FIVE TESTS FOR THE COMMON MEAN OF SEVERAL MULTIVARIATE NORMAL POPULATIONS

Bias-corrected AIC for selecting variables in Poisson regression models

Asymptotic Distribution of the Largest Eigenvalue via Geometric Representations of High-Dimension, Low-Sample-Size Data

A Multivariate Two-Sample Mean Test for Small Sample Size and Missing Data

An Information Criteria for Order-restricted Inference

The Role of "Leads" in the Dynamic Title of Cointegrating Regression Models. Author(s) Hayakawa, Kazuhiko; Kurozumi, Eiji

Testing Some Covariance Structures under a Growth Curve Model in High Dimension

SHOTA KATAYAMA AND YUTAKA KANO. Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka , Japan

ASYMPTOTIC RESULTS OF A HIGH DIMENSIONAL MANOVA TEST AND POWER COMPARISON WHEN THE DIMENSION IS LARGE COMPARED TO THE SAMPLE SIZE

On the conservative multivariate multiple comparison procedure of correlated mean vectors with a control

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Estimation of multivariate 3rd moment for high-dimensional data and its application for testing multivariate normality

Analysis Methods for Supersaturated Design: Some Comparisons

Testing equality of two mean vectors with unequal sample sizes for populations with correlation

A Study on the Bias-Correction Effect of the AIC for Selecting Variables in Normal Multivariate Linear Regression Models under Model Misspecification

Robustness of the Quadratic Discriminant Function to correlated and uncorrelated normal training samples

On the conservative multivariate Tukey-Kramer type procedures for multiple comparisons among mean vectors

On High-Dimensional Cross-Validation

A Robust Approach to Regularized Discriminant Analysis

Small sample size in high dimensional space - minimum distance based classification.

Extended Bayesian Information Criteria for Gaussian Graphical Models

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Extended Bayesian Information Criteria for Model Selection with Large Model Spaces

Inferences on a Normal Covariance Matrix and Generalized Variance with Monotone Missing Data

A Test for Order Restriction of Several Multivariate Normal Mean Vectors against all Alternatives when the Covariance Matrices are Unknown but Common

EXPLICIT EXPRESSIONS OF PROJECTORS ON CANONICAL VARIABLES AND DISTANCES BETWEEN CENTROIDS OF GROUPS. Haruo Yanai*

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

High-dimensional regression modeling

Edgeworth Expansions of Functions of the Sample Covariance Matrix with an Unknown Population

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

MS-C1620 Statistical inference

Lecture 32: Asymptotic confidence sets and likelihoods

A Fast Algorithm for Optimizing Ridge Parameters in a Generalized Ridge Regression by Minimizing an Extended GCV Criterion

Ken-ichi Kamo Department of Liberal Arts and Sciences, Sapporo Medical University, Hokkaido, Japan

A NOTE ON ESTIMATION UNDER THE QUADRATIC LOSS IN MULTIVARIATE CALIBRATION

Jackknife Bias Correction of the AIC for Selecting Variables in Canonical Correlation Analysis under Model Misspecification

Qing-Ming Cheng and Young Jin Suh

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

Variable selection in multiple regression with random design

Nonstationary spatial process modeling Part II Paul D. Sampson --- Catherine Calder Univ of Washington --- Ohio State University

Inference After Variable Selection

A Confidence Region Approach to Tuning for Variable Selection

Feature selection with high-dimensional data: criteria and Proc. Procedures

Chapter 1 Statistical Inference

Regression, Ridge Regression, Lasso

Hypothesis testing:power, test statistic CMS:

Global Model Fit Test for Nonlinear SEM

AR-order estimation by testing sets using the Modified Information Criterion

A COMMENT ON FREE GROUP FACTORS

MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.)

Comment about AR spectral estimation Usually an estimate is produced by computing the AR theoretical spectrum at (ˆφ, ˆσ 2 ). With our Monte Carlo

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

KRUSKAL-WALLIS ONE-WAY ANALYSIS OF VARIANCE BASED ON LINEAR PLACEMENTS

Extreme inference in stationary time series

Lecture 7: Model Building Bus 41910, Time Series Analysis, Mr. R. Tsay

arxiv: v1 [stat.co] 26 May 2009

1. Introduction Over the last three decades a number of model selection criteria have been proposed, including AIC (Akaike, 1973), AICC (Hurvich & Tsa

1 Introduction. 2 AIC versus SBIC. Erik Swanson Cori Saviano Li Zha Final Project

The Third International Workshop in Sequential Methodologies

SIMULTANEOUS CONFIDENCE INTERVALS AMONG k MEAN VECTORS IN REPEATED MEASURES WITH MISSING DATA

Outline. Mixed models in R using the lme4 package Part 3: Longitudinal data. Sleep deprivation data. Simple longitudinal data

High-dimensional two-sample tests under strongly spiked eigenvalue models

Shrinkage Tuning Parameter Selection in Precision Matrices Estimation

Mixture Modeling. Identifying the Correct Number of Classes in a Growth Mixture Model. Davood Tofighi Craig Enders Arizona State University

A Method to Estimate the True Mahalanobis Distance from Eigenvectors of Sample Covariance Matrix

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

ASYMPTOTICALLY NONEXPANSIVE MAPPINGS IN MODULAR FUNCTION SPACES ABSTRACT

ACCURATE ASYMPTOTIC ANALYSIS FOR JOHN S TEST IN MULTICHANNEL SIGNAL DETECTION

Minimal basis for connected Markov chain over 3 3 K contingency tables with fixed two-dimensional marginals. Satoshi AOKI and Akimichi TAKEMURA

SUPPLEMENT TO TESTING UNIFORMITY ON HIGH-DIMENSIONAL SPHERES AGAINST MONOTONE ROTATIONALLY SYMMETRIC ALTERNATIVES

Model comparison and selection

Variable Selection for Highly Correlated Predictors

Structure estimation for Gaussian graphical models

Transcription:

Consistency of Distance-based Criterion for Selection of Variables in High-dimensional Two-Group Discriminant Analysis Tetsuro Sakurai and Yasunori Fujikoshi Center of General Education, Tokyo University of Science, Suwa, 5000- Toyohira, Chino, Nagano 39-0292, Japan Department of Mathematics, Graduate School of Science, Hiroshima University, -3- Kagamiyama, Higashi Hiroshima, Hiroshima 739-8626, Japan Abstract This paper is concerned with selection of variables in two-group discriminant analysis with the same covariance matrix. We propose a distance-based criterion (DC drawing on the distance of each variables. The selection method can be applied for high-dimensional data. Sufficient conditions for the distance-based criterion to be consistent are provided when the dimension and the sample are large. Our results, and tendencies therein are explored numerically through a Monte Carlo simulation. AMS 2000 subject classification: primary 62H2; secondary 62H30 Key Words and Phrases: Consistency, Discriminant analysis, High-dimensional framework, Selection of variables, Distance-based criterion.

. Introduction This paper is concerned with the variable selection problem in two-group discriminant analysis with the same covariance matrix. Some methods have been proposed for high-dimensional data as well as large-sample data. One of the methods are based on model selection criteria such as AIC and BIC, see, e.g., Fujikoshi (984, Nishii et al. (988. There are methods based on misclassification errors by MaLachlan (976, Fujikoshi (985, Hyodo and Kubokawa (204, Yamada et al. (207. For high-dimensional data, there are Lasso and other regularization methods by Clemmensen et al. (20, Witten and Tibshirani (20, etc. In this paper we propose a distance-based criterion based on the distance of each variables, which is useful for high-dimensional data as well as largesample data. Here, the distance of each variables is measured as the one except for the effect of other variables. The criterion involves a constant term which should be determined through some optimality. We propose a class of constants satisfying a consistency when the dimension and the sample size are large. The class depends on the parameters. With respect to the actual use, we also propose a class of estimators for the constant term which satisfies a consistency. Our results, and tendencies therein are explored numerically through a Monte Carlo simulation. The remainder of the present paper is organized as follows. In Section 2, we present the relevant notation and the distance-based criterion. In Sections 3 we derive sufficient conditions for the distance-based criterion to be consistent under a high-dimensional case. In Section 4 we also propose a class of estimators for the constant term. The proposed distance-based criterion is also examined through a Monte Carlo simulation in Section 5. In Section 6, conclusions are offered. All proofs of our results are provided in the Appendix. 2

2. Distance-based Criterion In two-group discriminant analysis, suppose that we have independent samples y (i,..., y n (i i from p-dimensional normal distributions N p (µ (i, Σ, i =, 2. Let Y be the total sample matrix defined by Y = (y (,..., y ( n, y (2,..., y (2 n 2. Let and D be the population and the sample Mahalanobis distances defined by = { (µ ( µ ( Σ (µ ( µ ( } /2, and D = { ( x ( x (2 S ( x ( x (2 } /2, respectively. Here x ( and x (2 are the sample mean vectors, and S be the pooled sample covariance matrix based on n = n + n 2 samples. Let β be the coefficients of the population discriminant function given by β = Σ (µ ( µ (2 = (β,..., β p. (2. Suppose that j denotes a subset of ω = {,..., p} containing p j elements, and y j denotes the p j vector consisting of the elements of y indexed by the elements of j. We use the notation D j and D ω for D based on y j and y ω (= y, respectively. One of the approaches for variable selection is to consider a set of models as follows. For each subset j of ω = {,..., p}, consider a variable selection model M j defined by M j : β i 0 if i j, and β i = 0 if i j. (2.2 We identify the selection of M j with the selection of y j. Let AIC j be the AIC for M j. Then, it is known (see, e.g., Fujikoshi 985 that A j = AIC j AIC ω (2.3 { = n log + g2 (Dω 2 Dj 2 } 2(p p n 2 + g 2 Dj 2 j, 3

where g = (n n 2 /n. Similarly, let BIC j be the BIC for M j, and we have B j = BIC j BIC ω (2.4 { = n log + g2 (Dω 2 Dj 2 } (log n(p p n 2 + g 2 Dj 2 j. In a large sample framework, it is known (see, Fujikoshi 985, Nishii et al. 988 that AIC is not consistent, but BIC is consistent. The variable selection methods based on AIC and BIC are given as min j AIC j and min j BIC j, respectively. Therefor, such model selection methods have a computationally onerous problem when p is large, since the methods involve minimizing with respect to the subsets of 2 p. To circumvent this issue, we consider a distance-based criterion (DC drawing on the distance of each variables. A contribution of y i in the distance between two groups may be measured by D 2 D 2 ( i (2.5 where ( i is the subset of ω = {,..., p} obtained by omitting i from ω. If D 2 D 2 ( i is large, it may be considered that y i has a large contribution in discriminant analysis. Conversely, if D 2 D( i 2 is small, it may be considered that y i has a small contribution in discriminant analysis. Let us define D d,i = D 2 D 2 ( i d, i =,..., p, (2.6 where d is a positive constant. We propose a distance-based criterion for the selection of variables defined by selecting the set of suffixes or the set of variables given by DC d = {i ω D d,i > 0}, (2.7 or {y i {y,..., y p } D d,i > 0}. The notation ĵ DCd is also used for DC d. For a distance-based criterion, there is an important problem how to decide the constant term d. In Section 3 we derive a range of d which has a high-dimensional consistency. The range depends on the population parameters. For a practical usefulness, we also propose a class of estimators 4

for d given by d = n 2 + { g2 D 2 a(np /3 + n p }, (2.8 (n p g 2 n p 3 where a is a constant satisfying a = O(. It is shown that DC d is consistent under a high-dimensional framwork. 3. High-dimensional Consistency For studying consistency properties of the variable selection criterion DC d, it is assumed that the true model M is specified through the values of µ (, µ (2 and Σ. For simplicity, we write the variable selection models in terms of β, or M j. Assume that the true model M is included in the full model M ω. Let the minimum model including M be M j. For a notational simplicity, we regard the true model M as M j. Let F be the entire suite of candidate models, defined by F = {{},..., {k}, {, 2},..., {,..., k}}, Let F separate into two sets, one is a set of overspecified models, i.e., F + = {j F j j} and the other is a set of underspecified models, i.e., F = F+ c F. Here we list some of our main assumptions: A (The true model: M j F. A2 (The high-dimensional asymptotic framework: p, n, p/n c (0,. A3 : p is finite, and 2 = O(. A4 : n i /n k i > 0, i =, 2. 5

Relating to our proof of a consistency of DC d in (2.7, note that DC d = j D d,i > 0 for i j, and D d,i 0 for i / j Therefore, P (DC d = j = P D d,i > 0 D d,i < 0 i j i/ j = P D d,i 0 D d,i 0 i j i/ j P (D d,i 0 P (D d,i 0. i j i/ j Our result will be otained by proving the followings: [F] i j P (D d,i 0 0. (3. [F2] i/ j P (D d,i 0 0, (3.2 [F ] is the probability that DC d does not select the set of true variables. [F 2] is the probability that DC d select the set of no true variables. Our consistency depends on where τ i = 2 2 ( i, i ω. τ min = min i j τ i, (3.3 Theorem 3.. Suppose that assumptions A, A2, A3 and A4 are satisfied. Let b = τ min /( c, where τ min is defined by (3.3. Then, if 0 < d < b, the distance-based criterion DC d satisfies [F], i.e., i j P (D d,i 0 0. (3.4 6

Related to a sufficient condition for (3.2, we use the following result: for i j, E { } D 2 D( i 2 n 2 { = g 2 (n 3 + 2} (3.5 (n p 3(n p 2 q(p, n, 2. Theorem 3.2. Suppose that assumptions A, A2, A3 and A4 are satisfied. Then, if d = O( and d > q(p, n, 2, the distance-based criterion DC d satisfies P (D d,i 0 0, (3.6 i/ j Combining Theorems 3. and 3.2, a sufficiency condition for DC d to be consistent is given as follows:. Theorem 3.3. Suppose that assumptions A, A2, A3 and A4 are satisfied. Then, if d = O( and q(p, n, 2 < d < b = τ min /( c, (3.7 the distance-based criterion DC d is consistent. For an actual use, it is necessary to get an estimator d for d. This problem is discussed in Section 4. 4. A Method of Determining the Constant Term In the previous section we have seen that the distance-based criterion DC d is consistent if we use a constant term d such that d = O( and q(p, n, < d < τ min, where q(p, n, is given by (3.5, α min = min i j α i and α i = 7

2 2 ( i. However, such a d involves unknown parameters. We need to estimate d by using the estimators of 2 and τ min. Here, we consider to construct an estimator by using the fact that DC d is based on the test of α i = 0 or β i = 0. The likelihood test is based on (see, e.g., Rao 973, Fujikoshi et al. 200 g 2 D 2 {i} ( i n 2 + g 2 D 2 ( i whose null distribution is a ratio χ 2 /χ 2 n p of independent chi-squared variates χ 2 and χ 2 n p, where D 2 {i} ( i = D2 D 2 ( i. Let us define ˆd as follows: d = n 2 + { g2 D 2 a(np /3 + n p }. (4. (n p g 2 n p 3 Here a is a costant sasisfying a = O(. Then, it is shown that DC d has a consistency. Theorem 4.. Suppose that assumptions A, A2, A3 and A4 are satisfied. Then, the distance-based criterion DC d with d in (4. is consistent. As a special ˆd, we consider the followings: d = n 2 + { g2 D 2 (np /3 + n p }, (4.2 (n p g 2 n p 3 d 2 = n 2 + { g2 D 2 ( p 2 (np /3 + n p }. (4.3 (n p g 2 n n p 3 The estimators d and d 2 are defined from d by choosing a as a = and a = ( p/n 2, respectively. In Theorem 3.3, we have seen that DC d is consistent under a highdimensional framework if d satisfies q(p, n, 2 < d < τ min /( c. The expectation of d is related to the range as follows: ( E( d > q(p, n, 2, (2 lim E( d = 0 τ min /( c. p/n c 8

The above result follows from that { E[ ˆd] = a(np /3 + n p } n 2 + g 2 p+g 2 2 n p 3 n p 3 (n p g { 2 } = a(np /3 (n 2(n 3 g 2 (n p 3(n p + n 2 (n p 3(n p 2 + (n 2(n 3 g 2 (n p 3 + n 2 2 (n p 3 2 2. 5. Numerical Study In this section we numerically explore the validity of our claims through three distance-based criteria, DC d0, DC d, and DC d2. Here, d 0 is the midpoint in the interval [q(p, n, 2, τ min /( c] in (3.7. The true model was assumed as follows: the true dimension is p = 4, 8, the true mean vectors; µ = α(,...,, 0,..., 0, µ 2 = α(,...,, 0,..., 0, and the true covariance matrix is Σ = I p. The selection rates associated with these criteria are given in Tables 5. to 5.4. Under, True, and Over denote the underspecified models, the true model, and the overspecified models, respectively. We focused on selection rates for 0 3 replications in Tables 5. 5.4. From these tables, we can identify the following tendencies. Table 5.. Selection rates of DC d0, DC d and DC d2 for p = 4 and α = 9

p = 4, α = d 0 d d2 n n 2 p Under True Over Under True Over Under True Over 50 50 0 0.34 0.65 0.0 0.44 0.56 0.00 0.25 0.75 0.0 00 00 0 0. 0.89 0.00 0.00.00 0.00 0.00 0.99 0.0 200 200 0 0.0 0.99 0.00 0.00.00 0.00 0.00.00 0.00 50 50 25 0.39 0.5 0.0 0.95 0.05 0.00 0.45 0.52 0.03 00 00 50 0.9 0.8 0.0 0.63 0.37 0.00 0.06 0.94 0.00 200 200 00 0.03 0.97 0.00 0.06 0.94 0.00 0.00.00 0.00 50 50 50 0.55 0.7 0.29.00 0.00 0.00 0.54 0.24 0.22 00 00 00 0.28 0.59 0.3.00 0.00 0.00 0.0 0.6 0.29 200 200 200 0.09 0.9 0.00 0.99 0.0 0.00 0.00 0.92 0.08 Table 5.2. Selection rates of DC d0, DC d and DC d2 for p = 4 and α = 0 p = 4, α = 0 d 0 d d2 n n 2 p Under True Over Under True Over Under True Over 50 50 0 0.28 0.7 0.0 0.8 0.82 0.00 0.0 0.90 0.0 00 00 0 0.07 0.93 0.00 0.00.00 0.00 0.00.00 0.00 200 200 0 0.00.00 0.00 0.00.00 0.00 0.00.00 0.00 50 50 25 0.34 0.62 0.04 0.76 0.24 0.00 0.9 0.78 0.03 00 00 50 0.0 0.90 0.00 0.8 0.82 0.00 0.0 0.98 0.0 200 200 00 0.0 0.99 0.00 0.00.00 0.00 0.00.00 0.00 50 50 50 0.49 0.29 0.22.00 0.00 0.00 0.30 0.36 0.34 00 00 00 0.22 0.72 0.06.00 0.0 0.00 0.02 0.69 0.29 200 200 200 0.05 0.95 0.00 0.8 0.9 0.00 0.00 0.90 0.0 Table 5.3. Selection rates of DC d0, DC d and DC d2 for p = 8 and α = p = 8, α = d 0 d d2 n n 2 p Under True Over Under True Over Under True Over 50 50 0 0.8 0.8 0.0.00 0.00 0.00.00 0.00 0.00 00 00 0 0.52 0.48 0.00 0.79 0.2 0.00 0.65 0.35 0.00 200 200 0 0.20 0.80 0.00 0.03 0.97 0.00 0.02 0.98 0.00 50 50 25 0.86 0.05 0.09.00 0.00 0.00.00 0.00 0.00 00 00 50 0.68 0.26 0.06.00 0.00 0.00 0.97 0.03 0.00 200 200 00 0.30 0.69 0.0.00 0.00 0.00 0.53 0.47 0.00 50 50 50 0.93 0.00 0.07.00 0.00 0.00.00 0.00 0.00 00 00 00 0.79 0.05 0.6.00 0.00 0.00 0.96 0.02 0.02 200 200 200 0.52 0.40 0.09.00 0.00 0.00 0.53 0.43 0.04 0

Table 5.4. Selection rates of DC d0, DC d and DC d2 for p = 8 and α = 0 p = 8, α = d 0 d ˆd2 n n 2 p Under True Over Under True Over Under True Over 50 50 0 0.79 0.9 0.02.00 0.00 0.00 0.99 0.0 0.00 00 00 0 0.48 0.52 0.00 0.6 0.40 0.00 0.48 0.52 0.00 200 200 0 0.4 0.86 0.00 0.00.00 0.00 0.00.00 0.00 50 50 25 0.84 0.08 0.08.00 0.00 0.00.00 0.00 0.00 00 00 50 0.60 0.33 0.06.00 0.00 0.00 0.9 0.09 0.00 200 200 00 0.25 0.75 0.00.00 0.00 0.00 0.27 0.73 0.00 50 50 50 0.9 0.00 0.09.00 0.00 0.00 0.99 0.00 0.0 00 00 00 0.74 0.08 0.8.00 0.00 0.00 0.89 0.06 0.04 200 200 200 0.45 0.49 0.06.00 0.00 0.00 0.3 0.63 0.06 From Tables 5. 5.4, we can identify the following tendncies. In all the DC criteria with d 0, d and d, there are cases which their selection rates are high. Especially, when the dimension p is small as in p = 0, their selection rates are near to. However, when n and p are near, their selection rates do not near to one when n and p are not large. On the other hand, for the case which a consistency can not be seen in Tables, it will be expected to be consistent as the values of n and p are large. In all the DC criteria with d 0, d and d, their selection rates of the true model are incresing as n and p are large. In all the DC criteria with d 0, d and d, their selection rates of the true model are decreasing as n and p are closer. In all the DC criteria with d 0, d and d, their selection rates of the true model are decreasing as the dimension p is increasing. In all the DC criteria with d 0, d and d, their selection rates of the true model are increasing as the distance between groups or α is increasing. The DC criteria with d 0 has a tendency of not selecting overspecified models as p is small in comparison with n, and has a tendency of selecting overspcified models as p and n are near.

The DC criteria with d does not select overspesified models, and has a tendancy of choosing the variables strictly. The DC criteria with d 2 has a tendency of not selectiong overspecified models when p is small, and has a tendency of selecting overspesified models as n and p are near, as in the DC criteria with d 0. 6. Concluding Remarks In this paper we propose a distance-based criterion (DC for the variable selection problem, based on drawing on the significance of each the variables except for the effect due to other variables. The criterion involves a constant term d, and is denoted by DC d. DC d need not to examine all the subsets, but need to examine only the p subsets ω ( i. The circumvent computational complexities associated with all the subsets selection procedures been resolved. Further, it was identified that a range of d and a class of its estimators such that DC d has a high-dimensional consistency property. Especially the criteria DC d with d = d 0, d = d and d = d 2 were numerically examined. However, an optimality problem on d is left for future research. A study of high-dimensional consistency properties when p is infinite and 2 = O(p are also left. Appendix: Proofs of Theorems 3. and 3.2 A Preliminary Lemmas First we summarize the distributional results related to the Mahalanobis distance and the statistics D d,i in (2.5. For a notational simplicity, consider a decomposition of y = (y, y 2, y; p, p 2 ; p 2. Let D and D be the 2

sample Mahalanobis distances based on y and y, respectively. Let D2 2 = D 2 D. 2 Similarly, the corresponding population quantities are expressed as, and 2 2. Let the coefficient vector β of the linear discriminant function in (2. decompose as β = (β, β 2, β ; p, β 2 ; p 2. The following lemmas (see, e.g., Fujikoshi et al. 200 are used. Lemma A.. For the population Mahalanobis distances and, it holds that ( 2 2 = 2 2 0. (2 2 2 = 0 β 2 = 0. Note that 2 2 ( i = 0 β i = 0. (A. Lemma A.2. For the sample Maharanobis distances D and D, it holds that ( D 2 = (n 2g 2 χ 2 p (g 2 2 { χ 2 n p }. (2 If 2 2 = 0, g 2 (D 2 D 2 {n 2 + g 2 D 2 } = χ 2 p 2 {χ 2 n p }. (3 If 2 2 = 0, (D 2 D{n 2 2 + g 2 D} 2 and D 2 are independent, (4 D2 2 = (n 2g 2 χ 2 p 2 (g 2 2 {χ 2 2 + R n p } ( + R, where R = χ 2 p (g 2 2 { χ 2 n p }. Here, χ 2 p (, χ 2 n p, χ 2 p 2 (, and χ 2 n p are independent chi-square variates. From Lemma A., D 2 and D 2 ( i can be expressed as D 2 = (n 2g 2 χ2 p(g 2 2 χ 2 n p, D( i 2 = (n 2g 2 χ2 p (g 2 2 ( i, (A.2 χ 2 n p 2 where g = (n n 2 /n. It is easy to see that m χ2 p m, if η 2 = lim m δ 2 /m. m χ2 m(δ 2 p + η 2, (A.3 3

A2 Proofs of Theorems 3. and 3.2 Without loss of generality, we may j = {,..., s} and p = s. Then, [F] and [F2] are expressed as [F] = s p P (D d,i 0, [F2] = P (D d,i 0. i= i=s+ Proof of Theorem 3. Using (A.2 and (A.3, Similarly, Therefor, D 2 = (n 2g 2 χ2 p(g 2 2 χ 2 n p n = (n 2 χ2 p(g 2 2 n p n n 2 p χ 2 n p p c k k 2 ( c + c 2. D 2 ( i p D 2 D 2 ( i c k k 2 ( c + c 2 ( i. p ( 2 2 c ( i p n p If i j = {,..., s}, then, τ i = 2 2 ( i > 0, i =,..., s, and D 2 D 2 ( i p c τ i. Therefor, if d < c τ min, where τ min = min i=,...,s τ i, D d,i = D 2 D 2 ( i d is positive, and P (D d,i 0 0 s [F] = P (D d,i 0 0, i= 4

since s is finite. Proof of Theorem 3.2 Next we shall show [F2] under the assumptions in Theorem 3.2. Assume that i {s +,..., p}. Then, we have 2 = 2 ( i. In general, note that D 2 D( i 2 > 0. Using Lemma A.2 (, and hence E(D 2 = (n 2np n n 2 (n p 3 + n 2 n p 3 2, E { D 2 D 2 ( i} = q(p, n, 2, where q(p, n, 2 is given by (3.5. Suppose that. Then, we have P ( D 2 D 2 ( i d d > q(p, n, 2. = P ( D 2 D 2 ( i E[D 2 D 2 ( i] d E[D 2 D 2 ( i] P ( D 2 D( i 2 E[D 2 D( i] 2 d E[D 2 D( i] 2 (d E[D 2 D( i 2 Var(D2 D( i. 2 ]2 The last inequality follows from Chebyshev s inequality. It is easy to see that E[D 2 D 2 ( i ] = O(n. Since d = O( from our assumption, (d E[D 2 D 2 ( i] 2 = O(. Using Lemma A.2 (4, the denominator is expressed as Var(D 2 D( i 2 = E{(D 2 D( i 2 2 } {E(D 2 D( i} 2 2 { ( χ = (n 2 2 g 4 2 2 ( E + χ2 p (g 2 2 2 } χ 2 n p (n 2 2 g 4 (n p 3 2 5 χ 2 n p ( + p + g2 2 2. n p 2

The expectation of the first term can be computed as { ( χ 2 2 ( E + χ2 p (g 2 2 2 } = χ 2 n p χ 2 n p { (n p 3(n p 5 E + 2 χ2 p (g 2 2 χ 2 n p ( χ 2 p (g 2 2 2 } + χ 2 n p = { + 2 p + g2 2 (n p 3(n p 5 n p 2 + (p } 2 + 2(p + 2(p g 2 2 + 4g 2 2 + g 4 4 (n p 2(n p 4 { } = + O( + O( = O(n 2. (n p 3(n p 5 The second term is evaluated as (n 2 2 g 4 ( + p + g2 2 2 (n p 3 2 n p 2 = O( (n p 3 ( + 2 O(2 = O(n 2. Therefore, we have Var(D 2 D 2 ( i = (n 2 2 g 4 O(n 2 O(n 2 = O(n 2. These imply the followings: P ( D 2 D( i 2 d O(n 2, and hence p P ( D 2 D( i 2 d 0, i=s+ which proves [F2] 0. 6

A3 Proof of Theorem 4. As in the proofs of Theorems 3. and 3.2, assume that j = {,..., s}, and show that [F] [F2] s P i= p i=s+ P ( D 2 D( i 2 d 0, ( D 2 D( i 2 d 0. First consider [F], and assume that i {,..., p }. In the proof of Theorem 3., it has been shown that D 2 Using this result, we have p c k k 2 ( c + c 2. d p 0. These imply that and hence D 2 D 2 ( i d p c τ i > 0, P (D 2 D 2 ( i d = 0, which implies [F] 0. Next, consider to show [F2] 0. Assume that i {s+,..., p}. Then, 2 = 2 ( i. Denote that D2 D 2 ( i = D {i} ( i. Using Lemma A.2(2 and (3, we can write as P ( D 2 D( i 2 d = P P = P ( ( χ 2 χ 2 n p χ 2 χ 2 n p g 2 d n 2 + g 2 D(i 2 g 2 d n 2 + g 2 D 2 ( F,n p n p n p 3 a(np/3. 7

Noting that the mean of F,n p is (n p (n p 3, ( P D 2 D( i 2 d P ( F,n p E[F,n p ] a(np /3 This implies that [F2] a 2 p i=j + which proves [F2] 0. = a 2 (np Var(F,n p 2/3 2n 2 (n a 2 (np 2/3 (n 2 2 (n 4. 2n 2 (n a 2 (np 2/3 (n 2 2 (n 4 p 2n 2 (n (np 2/3 (n 2 2 (n 4 = O(n /3, Acknowledgements The first author s research is partially supported by the Ministry of Education, Science, Sports, and Culture, a Grant-in-Aid for Scientific Research (C, 6K00047, 206-208. References [] Clemmensen, L., Hastie, T., Witten, D. M. and Ersbell, B. (20. Sparse discriminant analysis. Technometrics, 53, 406 43. [2] Fujikoshi, Y. (985. Selection of variables in two-group discriminant analysis by error rate and Akaike s information criteria. J. Multivariate Anal., 7, 27 37. [3] Fujikoshi, Y., Ulyanov, V. V. and Shimizu, R. (200. Multivariate Statistics: High-Dimensional and Large-Sample Approximations. Wiley, Hobeken, N.J. 8

[4] Hyodo, M. and Kubokawa, T. (204. A variable selection criterion for linear discriminant rule and its optimality in high dimensional and large sample data. J. Multivariate Anal., 23, 364 379. [5] Nishii, R., Bai, Z. D. and Krishnaia, P. R. (988. Strong consistency of the information criterion for model selection in multivariate analysis. Hiroshima Math. J., 8, 45 462. [6] Rao, C. R. (973. Linear Statistical Inference and Its Applications, 2nd ed. Wiley, New-York. [7] Yamada, T., Sakurai, T. and Fujikoshi, Y. (207. High-dimensional asymptotic results for EPMCs of W- and Z- rules. Hiroshima Statistical Research Group, TR 7-. [8] Witten, D. W. and Tibshirani, R. (20. Penalized classification using Fisher s linear discriminant. J. R. Statist. Soc. B, 73, 753 772. 9