On a General Two-Sample Test with New Critical Value for Multivariate Feature Selection in Chest X-Ray Image Analysis Problem

Similar documents
Solving Homogeneous Systems with Sub-matrices

Lecture 3. Inference about multivariate normal distribution

High-dimensional two-sample tests under strongly spiked eigenvalue models

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Small sample size in high dimensional space - minimum distance based classification.

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size

Statistical Inference On the High-dimensional Gaussian Covarianc

Estimation of the Bivariate Generalized. Lomax Distribution Parameters. Based on Censored Samples

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Explicit Expressions for Free Components of. Sums of the Same Powers

Approximations to the t Distribution

SHOTA KATAYAMA AND YUTAKA KANO. Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka , Japan

A Practical Method for Decomposition of the Essential Matrix

Research Article Sample Size Calculation for Controlling False Discovery Proportion

ECE 661: Homework 10 Fall 2014

Poincaré`s Map in a Van der Pol Equation

Empirical Power of Four Statistical Tests in One Way Layout

Novel Approach to Calculation of Box Dimension of Fractal Functions

A Statistical Analysis of Fukunaga Koontz Transform

A Signed-Rank Test Based on the Score Function

1. Introduction. S.S. Patil 1, Sachidananda 1, U.B. Angadi 2, and D.K. Prabhuraj 3

On corrections of classical multivariate tests for high-dimensional data

Research Article Degenerate-Generalized Likelihood Ratio Test for One-Sided Composite Hypotheses

Research Article A Note on Kantorovich Inequality for Hermite Matrices

Construction of Combined Charts Based on Combining Functions

On testing the equality of mean vectors in high dimension

Double Gamma Principal Components Analysis

LECTURE NOTE #3 PROF. ALAN YUILLE

Permutation-invariant regularization of large covariance matrices. Liza Levina

KKM-Type Theorems for Best Proximal Points in Normed Linear Space

arxiv: v1 [cs.lg] 22 Jun 2009

Empirical Comparison of ML and UMVU Estimators of the Generalized Variance for some Normal Stable Tweedie Models: a Simulation Study

Research Article Convex Polyhedron Method to Stability of Continuous Systems with Two Additive Time-Varying Delay Components

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Statistica Sinica Preprint No: SS R2

Dimensionality Reduction and Principal Components

On Powers of General Tridiagonal Matrices

Group Inverse for a Class of. Centrosymmetric Matrix

What is Principal Component Analysis?

Numerical Investigation of the Time Invariant Optimal Control of Singular Systems Using Adomian Decomposition Method

The Bayes classifier

Introduction to Machine Learning

On the Laplacian Energy of Windmill Graph. and Graph D m,cn

Learning gradients: prescriptive models

Computational functional genomics

A Study on Linear and Nonlinear Stiff Problems. Using Single-Term Haar Wavelet Series Technique

Motivating the Covariance Matrix

Example: Face Detection

Binary Relations in the Space of Binary Relations. I.

On Symmetric Bi-Multipliers of Lattice Implication Algebras

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Stationary Flows in Acyclic Queuing Networks

Hopf Bifurcation Analysis of a Dynamical Heart Model with Time Delay

Some Reviews on Ranks of Upper Triangular Block Matrices over a Skew Field

PCA and admixture models

The EM Algorithm for the Finite Mixture of Exponential Distribution Models

On Positive Stable Realization for Continuous Linear Singular Systems

Multivariate Statistical Analysis

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Association studies and regression

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

On Monitoring Shift in the Mean Processes with. Vector Autoregressive Residual Control Charts of. Individual Observation

The Rainbow Connection of Windmill and Corona Graph

Fractal functional regression for classification of gene expression data by wavelets

Estimation of Stress-Strength Reliability for Kumaraswamy Exponential Distribution Based on Upper Record Values

INFORMATION THEORY AND STATISTICS

The Credibility Estimators with Dependence Over Risks

Introduction to Machine Learning

Machine learning for pervasive systems Classification in high-dimensional spaces

Research Article Least Squares Estimators for Unit Root Processes with Locally Stationary Disturbance

Lecture 27. December 13, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Lecture: Face Recognition and Feature Reduction

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

A Study of Relative Efficiency and Robustness of Classification Methods

Hyperbolic Functions and. the Heat Balance Integral Method

Method of Generation of Chaos Map in the Centre Manifold

Some Properties of a Semi Dynamical System. Generated by von Forester-Losata Type. Partial Equations

Sufficient Dimension Reduction using Support Vector Machine and it s variants

Machine Learning Linear Classification. Prof. Matteo Matteucci

Step-down FDR Procedures for Large Numbers of Hypotheses

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

A Generalization of p-rings

Jackknife Empirical Likelihood Test for Equality of Two High Dimensional Means

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Random Matrices and Multivariate Statistical Analysis

A Class of Z4C-Groups

Interval Images Recognition and Fuzzy Sets

Hakone Seminar Recent Developments in Statistics

Pattern Recognition. Parameter Estimation of Probability Density Functions

A review of some semiparametric regression models with application to scoring

A Generalization of Generalized Triangular Fuzzy Sets

The Automorphisms of a Lie algebra

Quadratic Extended Filtering in Nonlinear Systems with Uncertain Observations

Rainbow Connection Number of the Thorn Graph

Marginal Screening and Post-Selection Inference

Introduction to Support Vector Machines

Improvements in Newton-Rapshon Method for Nonlinear Equations Using Modified Adomian Decomposition Method

Machine Learning (BSMC-GA 4439) Wenke Liu

Transcription:

Applied Mathematical Sciences, Vol. 9, 2015, no. 147, 7317-7325 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2015.510687 On a General Two-Sample Test with New Critical Value for Multivariate Feature Selection in Chest X-Ray Image Analysis Problem Samir B. Belhaouari, Hamada R. H. Al-Absi, Ramil F. Kuleev and Nasreddine Megrez Innopolis University, Innopolis, Russia Copyright c 2015 Samir B. Belhaouari, Hamada R. H. Al-Absi, Ramil F. Kuleev and Nasreddine Megrez. This article is distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract In this paper we propose a Two-Sample Test for the means of high dimensional data and a new method to calculate the critical value. The proposed test does not require any condition linking the data dimension and the sample size which makes it a good alternative to the Hotelling T 2 statistic when the data dimension is much larger than the sample size and/or the two sample covariance matrices are not equal. One of the most important application of the proposed test is multivariate feature selection in all fields specially where data dimension is high like image features, genes or finance data. It is also important to highlight the low computing time required by the proposed method to calculate the critical value. Mathematics Subject Classification: 62H15, 60K35, 62G10, 62E20 Keywords: High Dimensional Two Sample Test, Feature Selection 1 Introduction Features selection for classifier is a very important task to reach high accuracy in classification systems. It plays especially important role in complex machine

7318 Samir B. Belhaouari et al. learning and computer vision problems as medical image analysis ([14]). Onedimensional metrics measure the overlapping area between classes for a single feature independently of other features. However, in most areas, each feature does not work individually but, rather, with other features, and the correct way to select the best features for classification is to measure the contribution of a set of features together rather than one feature alone. To tackle this task, it is necessary to develop multidimensional metrics. Let { x 1, x 2,..., x n } and { y 1, y 2,..., y m } be two independent random samples generated in an i.i.d. manner from p-dimensional multivariate normal distributions X N( µ 1, Σ 1 ) and Y N( µ 2, Σ 2 ) respectively, where the mean vectors µ 1, µ 2 R p, and the covariance matrices Σ 1, Σ 2 are positive definite. Let us suppose that the mean vectors µ 1, µ 2 and the covariance matrices Σ 1, Σ 2 are unknown. In this paper, we consider the problem of measuring how close are the two Gaussian means vectors µ 1, µ 2 to each other, which translates to testing the high dimensional hypothesis H 0 : µ 1 = µ 2, versus H 1 : µ 1 µ 2. In case X and Y are not Gaussian, it is enough to assume that n and m are sufficiently large, so the Central Limit Theorem applies for samples means. Using maximum likelihood methods, means vectors can be estimated by samples s means, i.e µ 1 = X 1 n = x i, n and µ 2 = Ȳ = 1 m m y i. Hotelling s T 2 test ( [6] ) is the conventional test for the above hypothesis when the dimension p is fixed and less than n + m 2, and Σ 1 = Σ 2. This test is defined as T 2 := nm n + m ( X Ȳ ) T Σ 1 ( X Ȳ ), where Σ is the pooled samples covarince matrix given by Σ = 1 n + m 2 n ( x i X)( xi X) T 1 + n + m 2 = n 1 n + m 2 Σ 1 + m 1 n + m 2 Σ 2. m ( y i Ȳ )( y i Ȳ ) T

On a general two-sample test with new critical value 7319 With the Central Limit Theorem, we have n + m (p + 1) p(n + m 2) T 2 F p,n+m (p+1). When p > n + m 2, the matrix Σ is singular and the Hotelling s Test is not defined. As demonstrated in [1], the Hotelling s test is known to be inefficient even when p n + m 2 and is nearly as large as n + m 2. It is also important to highlight the fact that it is hard to verify the assumption Σ 1 = Σ 2 for high demensional data, and using the Hotelling s T 2 test like in [8] may be missleading. Moreover, since the hypothesis H 0 consists of the p-marginal hypotheses: H 0l : µ 1l = µ 2l for l = 1,..., p regarding the means on each data dimension, a natural question is how many hypotheses can be tested simultaneously. These problems were addressed in [4],[7], [3] and [12] with some limitations, complications, and time consuming. Thus, in this paper we provide a simple two sample test that works in all cases without any limitation on p and even when Σ 1 Σ 2. In the context of feature selections, this test estimate the overlapping area between classes. 2 Vector-Variate T statistical metric Let δ := X Ȳ denote the shift vector between the two samples. When H0 holds ( µ 1 = µ 2 ), the mean vector and the covariance matrix of δ are: ( ) ( ) ( ) ( ) E δ = E X Ȳ = E X E Ȳ = µ 1 µ 2 = 0, and [ ] [( ) ( Σ δ = E δ. δ T = E X Ȳ XT Ȳ )] T = Σ X + Σ Ȳ. As the values of the two samples are independant, the covariance matrices Σ X, Σ Ȳ can be evaluated as Σ X = Σ 1 n, and Σ Ȳ = Σ 2 m. Hence, Thus, for n, m big enough, Σ δ = Σ 1 n + Σ 2 m. δ N ( 0, Σ 1 n + Σ ) 2, m

7320 Samir B. Belhaouari et al. and we define our Vector-Variate T statistical metric as Z := ( ( ) 1/2 Σ δ Σ1 δ = n + Σ ) 1/2 2 δ N( 0, I), m where I is the identity Matrix. The covariance matrices Σ 1 and Σ 2 are square (p p) and positive definite, so Σ δ is square positive definte and, then, orthogonal diagonalizable. Let Λ : diag( λ 1, λ 2,..., λ p ) and θ := (ϑ 1, ϑ 2,..., ϑ p ) be the matrices formed by the square root of eigenvalues of Σ δ and by corresponding eigenvectors (respectively). Then, Σ δ can be written as Σ δ = (θ 1 Λθ) 2, and our vector variate metric Z can be rewritten as Z := (θ 1 Λ 1 θ) δ. 2.1 The critical value z α The shosen critical value fixes a trade-off between the risk of rejecting H 0 when H 0 actually holds, and the risk of accepting H 0 when H 1 holds. For a significance level α, we reject H 0 in favor of H 1 if Z 2 = z1 2 +... + zd 2 > z α, where α = P (z1 2 + z2 2 +... + zd 2 > z α). If p = 2 we can take z α = 2 ln(α). For higher dimension p > 2, we define α as: α := 2π 0 = 2π2p 2 2π p π dϕ 0 z α π sin θ 1 dθ 1... sin θ p 2 dθ 2 0 z α r p 1 2π p e r2 2 dr. r p 1 2π p e r2 2 dr Let Φ(x) be the cumulative distribution function of the standard Gaussian variable. We define the sequence with the intial values I(n, x) = x r n e r2 2 dr, I(0, x) = 2π(1 Φ(x)), and I(1, x) = e x2 2. Using integration by parts, we drive the following formula for n > 1: I(n, x) = x n e x2 2 + 2π(1 Φ(x)). When n is even, we derive the following recurrence formula I(n, x) = x n 1 e x2 2 + (n 1)I(n 2, x).

On a general two-sample test with new critical value 7321 2.2 Parameters Estimations The covariance matrix needs to be estimated if it is unknown. The unbiased estimator is the sample covariance matrix Σ δ = Σ 1 n + Σ n 2 m = ( x i X)( xi X) T m + ( y i Ȳ )( y i Ȳ ) T. n(n 1) m(m 1) Our estimation needs to be checked and corrected if the covariance estimated has a negative Eigen value. In practice, the negative Eigen values are very close to zero in general, then it is enough to replace all the negative Eigen values by a small positive number noted ε, which it should be smaller, at least 10 times, than the smallest positive Eigen values of the covariance matrix. For example if the first Eigen value is negative, the corrected estimated matrix of covariance is equals to 2 ε 0... 0 Σ δ = θ 1 0 λ 2... 0...... θ, 0 0... λp where θ is the matrix of adapted eigenvectors of the estimated covariance matrix and ε < min k {2,...,p} λ k /10. 3 Simulation 3.1 Feature selection For simulation purpose public JSRT database ([10]) was used. It is the standard digital image database with and without chest lung nodules that was created by the Japanese Society of Radiological Technology (JSRT) in cooperation with the Japanese Radiological Society (JRS) in 1998. Database contains 154 nodule and 93 non-nodule images with resolution 2048x2048 pixels. Symlets wavelets at level of decomposition 16 were applied to all images in order to extract features. To reduce the number of features after performing features extraction using Wavelet which results in large number of features, a Multi-feature selection is implemented. The objective is to choose the most significant set of features for classification, i.e. the mixture of features that will best distinguish the object (pathology) from the non-object (normal) classes. To accomplish this, a two-step feature selection process is performed. Proposed statistical metric was used to select best features for classification of regions on image to two classes. Results of our simulation are represented at the Table 1.

7322 Samir B. Belhaouari et al. Table 1: Dependency of the accuracy from the number of features Number of features Accuracy False Negative False Positive 1 0,8318 0,2424 0,1351 21 0,9159 0,1515 0,0541 41 0,9533 0,0303 0,0541 61 0,972 0,0303 0,027 81 0,972 0 0,0405 101 0,972 0 0,0405 121 0,9907 0 0,0135 141 0,9813 0,0303 0,0135 161 0,9907 0,0303 0 181 0,9907 0,0303 0 4 Conclusion We have proposed a simple two sample test that works in all cases without any limitation on p and even when Σ 1 Σ 2. This test was particularly conceived for multivariate feature selection like medical image analysis. Due to complexity of the problem each feature does not work indivudually, but tends to work with other features to achieve certain tasks. The Figure 1 shows the inefficiency of one-dimensional metric and the efficiency of two-dimensional metrics: In the context of multivariate feature selection, this test estimates the overlapping area between classes. More precisely, the metric will quantify the contribution of each q features selected among p features, total number of features. The proposed in subsection 2.1 method for calculation of the critical value z α will require reasonable computing time. ( p ) According to the value of the metric, the possible sets of q-features can q be ordered in terms of relevancy for classifier. A method to order the groups of features from the "best group" to the "worst group" is to compare the normalized value of the metric Z, where ν = X + Ȳ. ν 2 Acknowledgements. This work has been supported by the Russian Ministry of education and science with the project "Development of new perspective methods and algorithms for automatic recognition of thorax pathologies based on X-ray images" (agreement: 14.606.21.0002, ID: RFMEF160614X0002) and has been done in Innopolis University.

On a general two-sample test with new critical value 7323 Figure 1: Projection on the x-axes or on the y-axes shows a significant overlapping area between the two classes (+ and *), the two features in this example will be described are not significant for classification according to one-dimensional metrics. In other hand, the 2-dimensional metrics will have the capability to well quantify the real overlapping area, i.e. the two features will be described are significant for classification. References [1] Z. Bai, H. Saranadasa, Effect of high dimension: by an example of a two sample problem, Statistica Sinica, 6 (1996), 311-329. [2] L. Baringhaus, C. Franz, On a new multivariate two-sample test, Journal of Multivariate Analysis, 88 (2004), 190-206. http://dx.doi.org/10.1016/s0047-259x(03)00079-4 [3] S. X. Chen, Y. Qin, A two-sample test for high-dimensional data with applications to gene-set testing, The Annals of Statistics, 38 (2010), no. 2, 808-835. http://dx.doi.org/10.1214/09-aos716 [4] J. Fan, P. Hall and Q. Yao, To how many simultaneous hypothesis tests can normal, Student s t or bootstrap calibration be applied, J. Amer. Statist. Assoc., 102 (2007), 1282-1288. http://dx.doi.org/10.1198/016214507000000969

7324 Samir B. Belhaouari et al. [5] M. Han, X. Liu, Feature selection techniques with class separability for multivariate time series, Neurocomputing, 110 (2013), 29-34. http://dx.doi.org/10.1016/j.neucom.2012.12.006 [6] H. Hotelling, The generalization of Student s ratio, Annals of Mathematical Statistics, 2 (1931), no. 3, 360-378. http://dx.doi.org/10.1214/aoms/1177732979 [7] M.R. Kosorok, S. Ma, Marginal asymptotics for the "large p, small n" paradigm: With applications to microarray data, Ann. Statist., 35 (2007), 1456-1486. http://dx.doi.org/10.1214/009053606000001433 [8] Y. Lu, P.Y. Liu, P. Xiao, and H.W. Deng, Hotelling s T2 multivariate profiling for detecting differential expression in microarrays, Bioinformatics, 21 (2005), no. 14, 3105-3113. http://dx.doi.org/10.1093/bioinformatics/bti496 [9] M. F. Schilling, Multivariate Two-Sample Tests Based on Nearest Neighbors, Journal of the American Statistical Association, 81 (1986), no. 395, 799-806. http://dx.doi.org/10.2307/2289012 [10] J. Shiraishi, S. Katsuragawa, J. Ikezoe, T. Matsumoto, T. Kobayashi, K. Komatsu, M. Matsui, H. Fujita, Y. Kodera and K. Doi, Development of a digital image database for chest radiographs with and without a lung nodule: Receiver operating characteristic analysis of radiologists detection of pulmonary nodules, American Journal of Roentgenology, 174 (2000), no. 1, 71-74. http://dx.doi.org/10.2214/ajr.174.1.1740071 [11] M. S. Srivastava, S. Katayama, Y. Kano, A two sample test in high dimensional data, Journal of Multivariate Analysis, 114 (2013), 349-358. http://dx.doi.org/10.1016/j.jmva.2012.08.014 [12] M. Van der Laan, J. Bryan, Gene expression analysis with the parametric bootstrap, Biostatistics, 2 (2001), 445-461. http://dx.doi.org/10.1093/biostatistics/2.4.445 [13] L. Xu, Matrix-Variate Discriminative Analysis, Integrative Hypothesis Testing, and Geno-Pheno A5 Analyzer, Lecture Notes in Computer Science, 7751 (2013), 866-875. http://dx.doi.org/10.1007/978-3-642-36669-7_105 [14] A. N. Zakirov, R. F. Kuleev, A. S. Timoshenko and A. V. Vladimirov, Advanced Approaches to Computer-Aided Detection of Thoracic Diseases on Chest X-Rays, Applied Mathematical Sciences, 9 (2015), no. 88, 4361-4369. http://dx.doi.org/10.12988/ams.2015.54348

On a general two-sample test with new critical value 7325 [15] N. Zhou, L. Wang, A Modified T-test Feature Selection Method and Its Application on the HapMap Genotype Data, Genomics, Proteomics and Bioinformatics, 5 (2007), no. 3-4, 242-249. http://dx.doi.org/10.1016/s1672-0229(08)60011-x Received: November 15, 2015; Published: December 17, 2015