Feature Selection by Reordering *

Similar documents
Intrinsic products and factorizations of matrices

Investigating Measures of Association by Graphs and Tables of Critical Frequencies

Aijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules.

Matrix Basic Concepts

Matrices and Vectors

Ensembles of classifiers based on approximate reducts

Numerical Analysis: Solving Systems of Linear Equations

Finite Mixture Model of Bounded Semi-naive Bayesian Networks Classifier

Iterative Laplacian Score for Feature Selection

Computer simulation on homogeneity testing for weighted data sets used in HEP

The doubly negative matrix completion problem

1111: Linear Algebra I

Tensor rank-one decomposition of probability tables

MINIMAL GENERATING SETS OF GROUPS, RINGS, AND FIELDS

Introduction to Determinants

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm

On the application of different numerical methods to obtain null-spaces of polynomial matrices. Part 1: block Toeplitz algorithms.

A Metric Approach to Building Decision Trees based on Goodman-Kruskal Association Index

BUMPLESS SWITCHING CONTROLLERS. William A. Wolovich and Alan B. Arehart 1. December 27, Abstract

DETERMINANTS, COFACTORS, AND INVERSES

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4

Math 520 Exam 2 Topic Outline Sections 1 3 (Xiao/Dumas/Liaw) Spring 2008

Definition 2.3. We define addition and multiplication of matrices as follows.

Finite Math - J-term Section Systems of Linear Equations in Two Variables Example 1. Solve the system

Slow P -point Ultrafilters

ANALYTICAL MATHEMATICS FOR APPLICATIONS 2018 LECTURE NOTES 3

Affine Normalization of Symmetric Objects

HIGHER CLASS GROUPS OF GENERALIZED EICHLER ORDERS

Determinants of Partition Matrices

arxiv: v1 [math.ra] 23 Feb 2018

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Gaussian Elimination and Back Substitution

Scientific Computing: Dense Linear Systems

Computational Learning Theory

Lecture Notes in Machine Learning Chapter 4: Version space learning

DESCRIPTIONAL COMPLEXITY OF NFA OF DIFFERENT AMBIGUITY

Measuring transitivity of fuzzy pairwise comparison matrix

Near-domination in graphs

Online Learning, Mistake Bounds, Perceptron Algorithm

Spectrally arbitrary star sign patterns

A RELIEF Based Feature Extraction Algorithm

Linear Discrimination Functions

ON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH

MATH 1210 Assignment 4 Solutions 16R-T1

ON COST MATRICES WITH TWO AND THREE DISTINCT VALUES OF HAMILTONIAN PATHS AND CYCLES

Unions of perfect matchings in cubic graphs

Determinants: Uniqueness and more

STRUCTURE OF GEODESICS IN A 13-DIMENSIONAL GROUP OF HEISENBERG TYPE

Math "Matrix Approach to Solving Systems" Bibiana Lopez. November Crafton Hills College. (CHC) 6.3 November / 25

Concept of Level-Controlled R-S Fuzzy Memory Circuit

Archivum Mathematicum

8 Square matrices continued: Determinants

An Empirical Study of Building Compact Ensembles

A Priori and A Posteriori Machine Learning and Nonlinear Artificial Neural Networks

A New Wrapper Method for Feature Subset Selection

SOME NEW EXAMPLES OF INFINITE IMAGE PARTITION REGULAR MATRICES

A New Method to Find the Eigenvalues of Convex. Matrices with Application in Web Page Rating

Bagging and Boosting for the Nearest Mean Classifier: Effects of Sample Size on Diversity and Accuracy

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

ON ASSOUAD S EMBEDDING TECHNIQUE

Minimal basis for connected Markov chain over 3 3 K contingency tables with fixed two-dimensional marginals. Satoshi AOKI and Akimichi TAKEMURA

Sparse spectrally arbitrary patterns

Advanced Combinatorial Optimization September 22, Lecture 4

A Generalized Uncertainty Principle and Sparse Representation in Pairs of Bases

Multiplying matrices by diagonal matrices is faster than usual matrix multiplication.

MULTIPLIERS OF THE TERMS IN THE LOWER CENTRAL SERIES OF THE LIE ALGEBRA OF STRICTLY UPPER TRIANGULAR MATRICES. Louis A. Levy

Weight Initialization Methods for Multilayer Feedforward. 1

MAT 1302B Mathematical Methods II

Compressed Sensing and Neural Networks

Neighborhood spaces and convergence

LEVEL MATRICES G. SEELINGER, P. SISSOKHO, L. SPENCE, AND C. VANDEN EYNDEN

Accepting H-Array Splicing Systems and Their Properties

Radical Endomorphisms of Decomposable Modules

A New Hybrid System for Recognition of Handwritten-Script

Matrices and RRE Form

Matrices. Chapter Definitions and Notations

FORMULATION OF THE LEARNING PROBLEM

The cocycle lattice of binary matroids

Linear Algebra. The Manga Guide. Supplemental Appendixes. Shin Takahashi, Iroha Inoue, and Trend-Pro Co., Ltd.

Chapter 4 No. 4.0 Answer True or False to the following. Give reasons for your answers.

Maximum union-free subfamilies

1 Last time: inverses

STRUCTURAL AND SPECTRAL PROPERTIES OF k-quasi- -PARANORMAL OPERATORS. Fei Zuo and Hongliang Zuo

Computing a Transitive Opening of a Reflexive and Symmetric Fuzzy Relation

Hadamard matrices of order 32

Switching Neural Networks: A New Connectionist Model for Classification

Refined Inertia of Matrix Patterns

Topology Proceedings. COPYRIGHT c by Topology Proceedings. All rights reserved.

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 1 x 2. x n 8 (4) 3 4 2

Diversity-Based Boosting Algorithm

Factorization of integer-valued polynomials with square-free denominator

Inferring the Causal Decomposition under the Presence of Deterministic Relations.

Preliminary Linear Algebra 1. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics / 100

HYBRID FLOW-SHOP WITH ADJUSTMENT

RIGHT-LEFT SYMMETRY OF RIGHT NONSINGULAR RIGHT MAX-MIN CS PRIME RINGS

Algebraic matroids are almost entropic

BITS F464: MACHINE LEARNING

Recognition of Properties by Probabilistic Neural Networks

Transcription:

Feature Selection by Reordering * Marcel Jirina and Marcel Jirina jr. 2 Institute of Computer Science, Pod vodarenskou vezi 2, 82 07 Prague 8 Liben, Czech Republic marcel@cs.cas.cz 2 Center of Applied Cybernetics, FEE, CTU Prague, Technicka 2, 66 27 Prague 6 Dejvice, Czech Republic jirina@fel.cvut.cz Abstract. Feature selection serves for both reduction of the total amount of available data (removing of valueless data) and improvement of the whole behavior of a given induction algorithm (removing data that cause deterioration of the results). A method of proper selection of features for an inductive algorithm is discussed. The main idea consists in proper descending ordering of features according to a measure of new information contributing to previous valuable set of features. The measure is based on comparing of statistical distributions of individual features including mutual correlation. A mathematical theory of the approach is described. Results of the method applied to real-life data are shown. Introduction Feature selection is a process of data preparation for their consequential processing. Simply said, the feature selection filters out unnecessary variables. There are two aspects for feature selection. The first one is the time requirement for processing of large amount of data during learning [6]. The other one is finding that results of the induction algorithm (classification, recognition or also approximation and prediction) may be worse due to the presence of unnecessary features [2]. There exist two essentially different views, so called filter model and wrapper model [5]. In filter model features are selected independently on the induction algorithm. Wrapper models (methods) are tightly bound to an induction algorithm. Another approach can be more quantitative stating that each feature has some "weight" for its use by induction algorithm. There are lots of approaches trying to define and evaluate feature weights, usually without any relation to induction algorithm, e.g. [2], [6], [7]. * Final version published: Jiina Marcel, Jiina jr. M.: Feature Selection by Reordering. In: SOFSEM 2005: Theory and Practice of Computer Science. (Ed.: Vojtáš P., Bieliková M., Charron-Bost B., Sýkora O.) - Berlin, Springer-Verlag 2005, pp. 385-389 (ISBN: 3-540-24302-X) (ISSN: 0302-9743) (Lecture Notes in Computer Science. ; 338) Held: SOFSEM 2005. Conference on Current Trends in Theory and Practice of Computer Science /3./, Liptovský Ján, SK, 05.0.22-05.0.28 The work was supported by Ministry of Education of the Czech Rep. under project LN00B096.

2 Problem Formulation The suggested method is based on a selection of relevant (appropriate) feature set from a given set. This can be achieved without the need of a metric on the feature sets. In fact, a proper ordering of features or feature sets is sufficient. The measure for the ordering need not be necessarily a metrics in pure sense. It should give a tool for evaluating how much a particular feature brings new information to the set of features already selected. 3 The Method The suggested method considers features for a classification task into two classes (0 and ). The method for stating the measure of feature weight utilizes comparisons of statistical distributions of individual features and for each feature separately for each class. Comparison of distributions is derived from testing hypothesis whether two probability distributions are from the same source or not. The higher the probability that these distributions are different the higher is the influence of particular feature (variable) to proper classification. In fact, we do not evaluate correlation probability between a pair of features, but between subsets corresponding to the same class only. After these probabilities are computed, the ordering of features is possible. The first feature should bring maximal information for good classification, the second one a little less including ("subtracting") also correlation with the first, the third again a little less including correlations with two preceding features etc. The standard hypothesis testing is based on the following considerations: Given some hypothesis, e.g. two distributions are the same, or two variables are correlated. To this hypothesis some variable V is defined. Next, a probability p is computed from value of V often using some other information or assumptions. Then some level (threshold) P is chosen. If p P the hypothesis is assured, otherwise rejected. Sometimes instead of p the p is used and thus P and the test must be modified properly. The logic used in this paper is based on somethig "dual" to the considerations above: Let q = p be some probability (we call it the probability levels of rejection of hypothesis), Q = P be some level. If q < Q the hypothesis is assured, otherwise rejected. The larger q, the more likely the hypothesis is rejected (for the same level Q or P). In fact, the weights assigned to individual features are probability levels q related to rejection of hypotheses that distributions are the same or variables are correlated. Let F i be a feature. For the first ordering of individual features F, F 2,... as to their influence on proper classification we use the probability levels of rejection p ii of the hypothesis that probability distributions of feature the F i for class 0 and for class are the same. The p ii is based on a suitable test, e.g. Kolmogorov - Smirnov test [8] or Cramér von Mises test []. No correlation of features is considered for this first ordering. To include influence of correlations let p ij0 and p ij be probability levels of rejection that distributions of variables for class 0 are correlated and that distributions of variables for class are correlated, respectively. Taking all probability levels together, we have two triangular matrices, one for p ij0 and another for p ij, i,j =, 2,..., n. All results of pairwise distribution comparisons or correlations can be written in square matrix n n as follows 2

p p20 pn0 = p2 p22 p2n0 M. pn pn2 pnn In this matrix in diagonal entries are probability levels of rejection of hypothesis that that for feature of a given index and for class 0, and class the distributions are the same. In the upper triangular part there are probability levels p ij0 for class 0, and in the bottom triangular part the probability levels p ij for class. In the beginning the ordering of features is arbitrary. We now sort rows and columns in descending order according to diagonal elements p ii of the matrix M. After it, first, we reassign indexes according to this ordering. The first feature now is a feature having the largest difference in distributions for both classes. The second feature has lesser difference in distributions for both classes and can be possibly somehow correlated to the preceding feature, etc.. Then, first, we state correlation coefficient for class 0 of variables and 2, second, correlation coefficient for class of variables and 2 getting then probability levels p 20 and p 2 of rejection that distributions are correlated. The lesser these probabilities, the stronger correlation between features F and F 2 exists. Let us define independence level of feature F i on preceding features F k, k < i: i ii k= π = p p p. () i According to this formula the probability level of rejection that distributions for class 0 and are the same is modified by measure of dependence on preceding variables. For calculation of probability levels p ik 0 and p ki we use a standard approach [3] based on the fact that for small number of samples a nonzero correlation coefficient is found. We associate these probability levels to corresponding rows and columns and again we sort rows and columns according to π i in descending order. After it we again compute π i according to () using new ordering and new indexing of variables. This step is repeated until no change in ordering occurs. Fast convergence of this process was found. By this procedure features are reordered from original arbitrary ordering to new ordering such that the first feature has the largest π i, and the last the smallest π i. ik 0 ki 5 Results The suggested method is demonstrated on a task of feature ordering of eight UCI MLR real-life databases: Heart, Vote, Splice, Spam, Shuttle, Ionosphere, German, and Adult [9]. Results after feature reordering are shown in Fig.. 6 Conclusion We have presented a procedure for evaluating feature weights based on the idea that we need not evaluate subsets of features or build some metrics in the space of feature sub- 3

sets. Instead of metrics some ordering would suffice. This is much weaker condition then metric. In fact, we need ordering of features from the viewpoint of the ability of feature to bring something new to the set of features already selected. If features are properly ordered we need not measure any distance. Knowledge that one feature is more important than the other should be sufficient. Having features already ordered, the question on proper feature set selection is reduced from combinatorial complexity to linear or at worst polynomial complexity depending on the induction algorithm. 00.00% Heart Vote Splice Spam 0.00% Shuttle Ionosphere German Adult.00% 0.0% 0.0% 0 5 0 5 20 Feature No. Fig.. Dependence of the π i on feature number after reordering for the eight databases. References [] Csörgö, S., Faraway, J.J.: The Exact and Asymptotic Distributions of Cramér-von Mises Statistics. J.R. Statist. Soc. B vol. 58, No. (996) 22-234 [2] Dong, M., Kothari, R.: Feature subset selection using a new definition of classifiability. Pattern Recognition Letters 24 (2003) 25 225 [3] Hátle, J, Likeš, J.: Basics in probability and mathematical statistics (in Czech), SNTL/ALFA Praha (974) [4] Jirina, M., Jirina,M.,jr.: Feature Selection By Reordering according to their Weights. Technical Report No. 99, Institute of Computer Science AS CR, November 2004, 9 pp. http://www.cs.cas.cz/people/homepages/jirina_marcel.shtml [5] John, J.K., Kohavi, R., Pfleger, K.: Irrelevant features and the Subset Selection problem. In: Machine Learning: Proc. of the Eleventh Int. Conf. ed. Cohen,W., Hirsh,H., Morgan Kaufmann Publishers, San Francisco, Ca., USA (994) 2-29 [6] Koller, D., Sahami, M.: Toward Optimal Feature Selection. Proc. of the Thirteenth Int. Conf. on Machine Learning. Morgan Kaufmann Publishers, San Francisco, Ca., USA, Morgan- Kaufman (996) 284-292 [7] Last, M., Kandel, A., Maimon, O.: Information-theoretic algorithm for feature selection. Pattern Recognition Letters 22 (200) 799-8 4

[8] Smirnov, N.: Table for estimating the goodness of fit of empirical distributions, Annals of Math. Statist. 9 (948) 279-28 [9] UCI Repository of Machine Learning Databases. http//www.ics.uci.edu/~mlearn/ MLRepository.html 5