On a General Two-Sample Test with New Critical Value for Multivariate Feature Selection in Chest X-Ray Image Analysis Problem

Applied Mathematical Sciences, Vol. 9, 2015, no. 147, 7317-7325 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2015.510687 On a General Two-Sample Test with New Critical Value for Multivariate Feature Selection in Chest X-Ray Image Analysis Problem Samir B. Belhaouari, Hamada R. H. Al-Absi, Ramil F. Kuleev and Nasreddine Megrez Innopolis University, Innopolis, Russia Copyright c 2015 Samir B. Belhaouari, Hamada R. H. Al-Absi, Ramil F. Kuleev and Nasreddine Megrez. This article is distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract In this paper we propose a Two-Sample Test for the means of high dimensional data and a new method to calculate the critical value. The proposed test does not require any condition linking the data dimension and the sample size which makes it a good alternative to the Hotelling T 2 statistic when the data dimension is much larger than the sample size and/or the two sample covariance matrices are not equal. One of the most important application of the proposed test is multivariate feature selection in all fields specially where data dimension is high like image features, genes or finance data. It is also important to highlight the low computing time required by the proposed method to calculate the critical value. Mathematics Subject Classification: 62H15, 60K35, 62G10, 62E20 Keywords: High Dimensional Two Sample Test, Feature Selection 1 Introduction Features selection for classifier is a very important task to reach high accuracy in classification systems. It plays especially important role in complex machine

7318 Samir B. Belhaouari et al. learning and computer vision problems as medical image analysis ([14]). Onedimensional metrics measure the overlapping area between classes for a single feature independently of other features. However, in most areas, each feature does not work individually but, rather, with other features, and the correct way to select the best features for classification is to measure the contribution of a set of features together rather than one feature alone. To tackle this task, it is necessary to develop multidimensional metrics. Let { x 1, x 2,..., x n } and { y 1, y 2,..., y m } be two independent random samples generated in an i.i.d. manner from p-dimensional multivariate normal distributions X N( µ 1, Σ 1 ) and Y N( µ 2, Σ 2 ) respectively, where the mean vectors µ 1, µ 2 R p, and the covariance matrices Σ 1, Σ 2 are positive definite. Let us suppose that the mean vectors µ 1, µ 2 and the covariance matrices Σ 1, Σ 2 are unknown. In this paper, we consider the problem of measuring how close are the two Gaussian means vectors µ 1, µ 2 to each other, which translates to testing the high dimensional hypothesis H 0 : µ 1 = µ 2, versus H 1 : µ 1 µ 2. In case X and Y are not Gaussian, it is enough to assume that n and m are sufficiently large, so the Central Limit Theorem applies for samples means. Using maximum likelihood methods, means vectors can be estimated by samples s means, i.e µ 1 = X 1 n = x i, n and µ 2 = Ȳ = 1 m m y i. Hotelling s T 2 test ( [6] ) is the conventional test for the above hypothesis when the dimension p is fixed and less than n + m 2, and Σ 1 = Σ 2. This test is defined as T 2 := nm n + m ( X Ȳ ) T Σ 1 ( X Ȳ ), where Σ is the pooled samples covarince matrix given by Σ = 1 n + m 2 n ( x i X)( xi X) T 1 + n + m 2 = n 1 n + m 2 Σ 1 + m 1 n + m 2 Σ 2. m ( y i Ȳ )( y i Ȳ ) T

On a general two-sample test with new critical value 7319 With the Central Limit Theorem, we have n + m (p + 1) p(n + m 2) T 2 F p,n+m (p+1). When p > n + m 2, the matrix Σ is singular and the Hotelling s Test is not defined. As demonstrated in [1], the Hotelling s test is known to be inefficient even when p n + m 2 and is nearly as large as n + m 2. It is also important to highlight the fact that it is hard to verify the assumption Σ 1 = Σ 2 for high demensional data, and using the Hotelling s T 2 test like in [8] may be missleading. Moreover, since the hypothesis H 0 consists of the p-marginal hypotheses: H 0l : µ 1l = µ 2l for l = 1,..., p regarding the means on each data dimension, a natural question is how many hypotheses can be tested simultaneously. These problems were addressed in [4],[7], [3] and [12] with some limitations, complications, and time consuming. Thus, in this paper we provide a simple two sample test that works in all cases without any limitation on p and even when Σ 1 Σ 2. In the context of feature selections, this test estimate the overlapping area between classes. 2 Vector-Variate T statistical metric Let δ := X Ȳ denote the shift vector between the two samples. When H0 holds ( µ 1 = µ 2 ), the mean vector and the covariance matrix of δ are: ( ) ( ) ( ) ( ) E δ = E X Ȳ = E X E Ȳ = µ 1 µ 2 = 0, and [ ] [( ) ( Σ δ = E δ. δ T = E X Ȳ XT Ȳ )] T = Σ X + Σ Ȳ. As the values of the two samples are independant, the covariance matrices Σ X, Σ Ȳ can be evaluated as Σ X = Σ 1 n, and Σ Ȳ = Σ 2 m. Hence, Thus, for n, m big enough, Σ δ = Σ 1 n + Σ 2 m. δ N ( 0, Σ 1 n + Σ ) 2, m

7320 Samir B. Belhaouari et al. and we define our Vector-Variate T statistical metric as Z := ( ( ) 1/2 Σ δ Σ1 δ = n + Σ ) 1/2 2 δ N( 0, I), m where I is the identity Matrix. The covariance matrices Σ 1 and Σ 2 are square (p p) and positive definite, so Σ δ is square positive definte and, then, orthogonal diagonalizable. Let Λ : diag( λ 1, λ 2,..., λ p ) and θ := (ϑ 1, ϑ 2,..., ϑ p ) be the matrices formed by the square root of eigenvalues of Σ δ and by corresponding eigenvectors (respectively). Then, Σ δ can be written as Σ δ = (θ 1 Λθ) 2, and our vector variate metric Z can be rewritten as Z := (θ 1 Λ 1 θ) δ. 2.1 The critical value z α The shosen critical value fixes a trade-off between the risk of rejecting H 0 when H 0 actually holds, and the risk of accepting H 0 when H 1 holds. For a significance level α, we reject H 0 in favor of H 1 if Z 2 = z1 2 +... + zd 2 > z α, where α = P (z1 2 + z2 2 +... + zd 2 > z α). If p = 2 we can take z α = 2 ln(α). For higher dimension p > 2, we define α as: α := 2π 0 = 2π2p 2 2π p π dϕ 0 z α π sin θ 1 dθ 1... sin θ p 2 dθ 2 0 z α r p 1 2π p e r2 2 dr. r p 1 2π p e r2 2 dr Let Φ(x) be the cumulative distribution function of the standard Gaussian variable. We define the sequence with the intial values I(n, x) = x r n e r2 2 dr, I(0, x) = 2π(1 Φ(x)), and I(1, x) = e x2 2. Using integration by parts, we drive the following formula for n > 1: I(n, x) = x n e x2 2 + 2π(1 Φ(x)). When n is even, we derive the following recurrence formula I(n, x) = x n 1 e x2 2 + (n 1)I(n 2, x).

On a general two-sample test with new critical value 7321 2.2 Parameters Estimations The covariance matrix needs to be estimated if it is unknown. The unbiased estimator is the sample covariance matrix Σ δ = Σ 1 n + Σ n 2 m = ( x i X)( xi X) T m + ( y i Ȳ )( y i Ȳ ) T. n(n 1) m(m 1) Our estimation needs to be checked and corrected if the covariance estimated has a negative Eigen value. In practice, the negative Eigen values are very close to zero in general, then it is enough to replace all the negative Eigen values by a small positive number noted ε, which it should be smaller, at least 10 times, than the smallest positive Eigen values of the covariance matrix. For example if the first Eigen value is negative, the corrected estimated matrix of covariance is equals to 2 ε 0... 0 Σ δ = θ 1 0 λ 2... 0...... θ, 0 0... λp where θ is the matrix of adapted eigenvectors of the estimated covariance matrix and ε < min k {2,...,p} λ k /10. 3 Simulation 3.1 Feature selection For simulation purpose public JSRT database ([10]) was used. It is the standard digital image database with and without chest lung nodules that was created by the Japanese Society of Radiological Technology (JSRT) in cooperation with the Japanese Radiological Society (JRS) in 1998. Database contains 154 nodule and 93 non-nodule images with resolution 2048x2048 pixels. Symlets wavelets at level of decomposition 16 were applied to all images in order to extract features. To reduce the number of features after performing features extraction using Wavelet which results in large number of features, a Multi-feature selection is implemented. The objective is to choose the most significant set of features for classification, i.e. the mixture of features that will best distinguish the object (pathology) from the non-object (normal) classes. To accomplish this, a two-step feature selection process is performed. Proposed statistical metric was used to select best features for classification of regions on image to two classes. Results of our simulation are represented at the Table 1.

7322 Samir B. Belhaouari et al. Table 1: Dependency of the accuracy from the number of features Number of features Accuracy False Negative False Positive 1 0,8318 0,2424 0,1351 21 0,9159 0,1515 0,0541 41 0,9533 0,0303 0,0541 61 0,972 0,0303 0,027 81 0,972 0 0,0405 101 0,972 0 0,0405 121 0,9907 0 0,0135 141 0,9813 0,0303 0,0135 161 0,9907 0,0303 0 181 0,9907 0,0303 0 4 Conclusion We have proposed a simple two sample test that works in all cases without any limitation on p and even when Σ 1 Σ 2. This test was particularly conceived for multivariate feature selection like medical image analysis. Due to complexity of the problem each feature does not work indivudually, but tends to work with other features to achieve certain tasks. The Figure 1 shows the inefficiency of one-dimensional metric and the efficiency of two-dimensional metrics: In the context of multivariate feature selection, this test estimates the overlapping area between classes. More precisely, the metric will quantify the contribution of each q features selected among p features, total number of features. The proposed in subsection 2.1 method for calculation of the critical value z α will require reasonable computing time. ( p ) According to the value of the metric, the possible sets of q-features can q be ordered in terms of relevancy for classifier. A method to order the groups of features from the "best group" to the "worst group" is to compare the normalized value of the metric Z, where ν = X + Ȳ. ν 2 Acknowledgements. This work has been supported by the Russian Ministry of education and science with the project "Development of new perspective methods and algorithms for automatic recognition of thorax pathologies based on X-ray images" (agreement: 14.606.21.0002, ID: RFMEF160614X0002) and has been done in Innopolis University.

On a general two-sample test with new critical value 7323 Figure 1: Projection on the x-axes or on the y-axes shows a significant overlapping area between the two classes (+ and *), the two features in this example will be described are not significant for classification according to one-dimensional metrics. In other hand, the 2-dimensional metrics will have the capability to well quantify the real overlapping area, i.e. the two features will be described are significant for classification. References [1] Z. Bai, H. Saranadasa, Effect of high dimension: by an example of a two sample problem, Statistica Sinica, 6 (1996), 311-329. [2] L. Baringhaus, C. Franz, On a new multivariate two-sample test, Journal of Multivariate Analysis, 88 (2004), 190-206. http://dx.doi.org/10.1016/s0047-259x(03)00079-4 [3] S. X. Chen, Y. Qin, A two-sample test for high-dimensional data with applications to gene-set testing, The Annals of Statistics, 38 (2010), no. 2, 808-835. http://dx.doi.org/10.1214/09-aos716 [4] J. Fan, P. Hall and Q. Yao, To how many simultaneous hypothesis tests can normal, Student s t or bootstrap calibration be applied, J. Amer. Statist. Assoc., 102 (2007), 1282-1288. http://dx.doi.org/10.1198/016214507000000969

7324 Samir B. Belhaouari et al. [5] M. Han, X. Liu, Feature selection techniques with class separability for multivariate time series, Neurocomputing, 110 (2013), 29-34. http://dx.doi.org/10.1016/j.neucom.2012.12.006 [6] H. Hotelling, The generalization of Student s ratio, Annals of Mathematical Statistics, 2 (1931), no. 3, 360-378. http://dx.doi.org/10.1214/aoms/1177732979 [7] M.R. Kosorok, S. Ma, Marginal asymptotics for the "large p, small n" paradigm: With applications to microarray data, Ann. Statist., 35 (2007), 1456-1486. http://dx.doi.org/10.1214/009053606000001433 [8] Y. Lu, P.Y. Liu, P. Xiao, and H.W. Deng, Hotelling s T2 multivariate profiling for detecting differential expression in microarrays, Bioinformatics, 21 (2005), no. 14, 3105-3113. http://dx.doi.org/10.1093/bioinformatics/bti496 [9] M. F. Schilling, Multivariate Two-Sample Tests Based on Nearest Neighbors, Journal of the American Statistical Association, 81 (1986), no. 395, 799-806. http://dx.doi.org/10.2307/2289012 [10] J. Shiraishi, S. Katsuragawa, J. Ikezoe, T. Matsumoto, T. Kobayashi, K. Komatsu, M. Matsui, H. Fujita, Y. Kodera and K. Doi, Development of a digital image database for chest radiographs with and without a lung nodule: Receiver operating characteristic analysis of radiologists detection of pulmonary nodules, American Journal of Roentgenology, 174 (2000), no. 1, 71-74. http://dx.doi.org/10.2214/ajr.174.1.1740071 [11] M. S. Srivastava, S. Katayama, Y. Kano, A two sample test in high dimensional data, Journal of Multivariate Analysis, 114 (2013), 349-358. http://dx.doi.org/10.1016/j.jmva.2012.08.014 [12] M. Van der Laan, J. Bryan, Gene expression analysis with the parametric bootstrap, Biostatistics, 2 (2001), 445-461. http://dx.doi.org/10.1093/biostatistics/2.4.445 [13] L. Xu, Matrix-Variate Discriminative Analysis, Integrative Hypothesis Testing, and Geno-Pheno A5 Analyzer, Lecture Notes in Computer Science, 7751 (2013), 866-875. http://dx.doi.org/10.1007/978-3-642-36669-7_105 [14] A. N. Zakirov, R. F. Kuleev, A. S. Timoshenko and A. V. Vladimirov, Advanced Approaches to Computer-Aided Detection of Thoracic Diseases on Chest X-Rays, Applied Mathematical Sciences, 9 (2015), no. 88, 4361-4369. http://dx.doi.org/10.12988/ams.2015.54348

On a general two-sample test with new critical value 7325 [15] N. Zhou, L. Wang, A Modified T-test Feature Selection Method and Its Application on the HapMap Genotype Data, Genomics, Proteomics and Bioinformatics, 5 (2007), no. 3-4, 242-249. http://dx.doi.org/10.1016/s1672-0229(08)60011-x Received: November 15, 2015; Published: December 17, 2015