Information-based Feature Selection

Similar documents
10-701/ Machine Learning Mid-term Exam Solution

Statistical Pattern Recognition

Lecture 11: Decision Trees

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Optimally Sparse SVMs

The Random Walk For Dummies

REGRESSION WITH QUADRATIC LOSS

Lecture 17: Minimax estimation of high-dimensional functionals. 1 Estimating the fundamental limit is easier than achieving it: other loss functions

Regression with quadratic loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

IP Reference guide for integer programming formulations.

CS284A: Representations and Algorithms in Molecular Biology

Support vector machine revisited

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Basics of Probability Theory (for Theory of Computation courses)

Random Variables, Sampling and Estimation

Lecture 11: Channel Coding Theorem: Converse Part

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

SNAP Centre Workshop. Basic Algebraic Manipulation

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

Lesson 10: Limits and Continuity

Lecture 2: Monte Carlo Simulation

An Introduction to Randomized Algorithms

6.3 Testing Series With Positive Terms

Math 10A final exam, December 16, 2016

Estimation for Complete Data

Vector Quantization: a Limiting Case of EM

Lecture 7: October 18, 2017

Discrete Orthogonal Moment Features Using Chebyshev Polynomials

Chapter 7. Support Vector Machine

Commutativity in Permutation Groups

6.867 Machine learning

Information Theory Tutorial Communication over Channels with memory. Chi Zhang Department of Electrical Engineering University of Notre Dame

Monte Carlo Integration

Empirical Process Theory and Oracle Inequalities

-ORDER CONVERGENCE FOR FINDING SIMPLE ROOT OF A POLYNOMIAL EQUATION

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Ma 530 Introduction to Power Series

1 Approximating Integrals using Taylor Polynomials

The z-transform. 7.1 Introduction. 7.2 The z-transform Derivation of the z-transform: x[n] = z n LTI system, h[n] z = re j

Math 113 Exam 3 Practice

Statistics 511 Additional Materials


w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Quantile regression with multilayer perceptrons.

THE SYSTEMATIC AND THE RANDOM. ERRORS - DUE TO ELEMENT TOLERANCES OF ELECTRICAL NETWORKS

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Lecture 16: Achieving and Estimating the Fundamental Limit

A proposed discrete distribution for the statistical modeling of

4.3 Growth Rates of Solutions to Recurrences

Discrete-Time Systems, LTI Systems, and Discrete-Time Convolution

Pattern Classification, Ch4 (Part 1)

Lecture 19: Convergence

1 Introduction to reducing variance in Monte Carlo simulations

Scheduling under Uncertainty using MILP Sensitivity Analysis

Section 5.1 The Basics of Counting

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

CS276A Practice Problem Set 1 Solutions

Lecture 14: Graph Entropy

Introduction to Machine Learning DIS10

MOMENT-METHOD ESTIMATION BASED ON CENSORED SAMPLE

Lecture 15: Learning Theory: Concentration Inequalities

Topic 9: Sampling Distributions of Estimators

Disjoint Systems. Abstract

A collocation method for singular integral equations with cosecant kernel via Semi-trigonometric interpolation

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS

Summary: CORRELATION & LINEAR REGRESSION. GC. Students are advised to refer to lecture notes for the GC operations to obtain scatter diagram.

Seunghee Ye Ma 8: Week 5 Oct 28

The Riemann Zeta Function

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Problem Set 4 Due Oct, 12

Lecture 23: Minimal sufficiency

A NEW CLASS OF 2-STEP RATIONAL MULTISTEP METHODS

6.883: Online Methods in Machine Learning Alexander Rakhlin

APPENDIX A SMO ALGORITHM

Properties and Hypothesis Testing

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

THE KALMAN FILTER RAUL ROJAS

1 Review and Overview

4.1 Data processing inequality

NUMERICAL METHODS FOR SOLVING EQUATIONS

Chapter 10: Power Series

Design and Analysis of Algorithms

Chapter 6 Sampling Distributions

Linear Regression Demystified

Mixtures of Gaussians and the EM Algorithm

Linear Classifiers III

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

CSE 527, Additional notes on MLE & EM

Addition: Property Name Property Description Examples. a+b = b+a. a+(b+c) = (a+b)+c

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Reliability and Queueing

1 Review of Probability & Statistics

FIXED POINTS OF n-valued MULTIMAPS OF THE CIRCLE

PRACTICE PROBLEMS FOR THE FINAL

The standard deviation of the mean

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

Sequences and Series of Functions

Transcription:

Iformatio-based Feature Selectio Farza Faria, Abbas Kazeroui, Afshi Babveyh Email: {faria,abbask,afshib}@staford.edu 1 Itroductio Feature selectio is a topic of great iterest i applicatios dealig with high-dimesioal datasets. These applicatios iclude gee expressio array aalysis, combiatorial chemistry ad text processig of olie documets. Usig feature selectio brigs about several advatages. First, it leads to lower computatioal cost ad time. Less memory is eeded to store the data ad less processig power is eeded. Feature selectio helps improve the performace of the predictors by avoidig overfittig. It ca also capture the uderlyig coectio betwee the data. Ad perhaps the most importat aspect, it ca break through the barrier of high-dimesioality. To select the most relevat subset of features, we eed a mathematical tool to measure depedece amog radom variables. I this work, we use the cocept of mutual iformatio. Mutual iformatio is a well-kow depedece measure i iformatio theory. For ay arbitrary pair of discrete radom variables, X X ad Y Y, Mutual Iformatio is defied as I(X; Y ) = x X, y Y p X,Y (x, y) log p X,Y (x, y) p X (x) p Y (y). (1) The paper is orgaized as follows. I sectio 2 the method of Maximum-Relevace Miimum- Redudacy (MRMR) is preseted alog with Maximum Joit Relevat (MJR) method. I sectio 3, we preset our method to solve the feature selectio problem. Sectio 4 presets the result of our algorithm tested o Madelo dataset. Fially, sectio 5 discusses the coclusio. 2 Mutual Iformatio as a tool for Feature Selectio As discussed earlier, mutual iformatio is a powerful tool i measurig relevace amog radom variables. Hece, it ca be a useful mathematical tool to fid ad select relevat features. I other words, if our goal is to select o more tha k features a optimal task is to solve arg max S =k I(X S; Y ), (2) where X S = {X i : i S}. However, as k gets larger our estimatio of mutual iformatio becomes less accurate. It is because for large k s we do ot have eough samples to estimate mutual iformatio accurately. Hece, the objective fuctio i (2) should be modified so that it becomes estimable by available samples. I the ext sectios, we first discuss a past approach to solve this issue ad the propose a ew solutio to improve such approaches. 1

2.1 Max-Relevace Mi-Redudacy (MRMR) approach As metioed earlier, we aim to idetify the most relevat subset of features whose size is limited to a give factor. Note that this is ot the same as characterizig the k best features with the most idividual mutual iformatio to the target Y. I fact, differet features may share redudat iformatio o the target. Thus, redudacy is aother importat factor to be cosidered i feature selectio. To balace the trade-off betwee relevace ad redudacy, the followig modified objective fuctio (MRMR) has bee suggested i [2]: Φ(X S, Y ) = 1 S i S I(X i ; Y ) 1 S 2 i,j S I(X i ; X j ). (3) Here, the first term measures the average relevace of features to the target, while the secod term measures average pairwise redudacy amog selected features. Therefore, maximizig Φ(X S, Y ) leads to idetifyig a well-characterizig feature subset whose total iformatio o the target is close to the optimal feature subset s. To maximize this objective, they used a iductive approach where first the most iformative feature is chose, ad the ext features are iductively added by solvig the followig at every step: 2.2 Maximum Joit Relevace arg max I(X j ; Y ) 1 I(X j ; X i ). (4) X j X\S m m 1 X i S m Although MRMR is a well-kow feature selectio method, there are several applicatios where the test error rate ever goes below some large thresholds like 34% which seems quite usatisfactory. Note that (3) icludes oly up to pairwise iteractios. By cosiderig higher order iteractios we ca become able to select a more iformative feature subset which i tur results i smaller error rates. To this ed, Maximum Joit Relevat (MJR) algorithm chages the iductive rule of (4) to a more sesitive oe [3]: arg max I(X j, X i ; Y ). (5) X j X\S m X i S m Nevertheless, we may agai ecouter the issue of lack of eough samples to estimate the secod order mutual iformatio appeared i the above formulatio. As a matter of fact, a cosiderable umber of third order empirical margials may become too small, ad thus it requires a more accurate estimatio of mutual iformatio tha the empirical oe. Therefore, i ext sectio we are goig to propose a ew algorithm to estimate mutual iformatio with higher accuracy. As a importat advatage, this estimatio techique reduces the required sample size to estimate mutual iformatio withi the same accuracy. 3 Adaptive Maximum Joit Relevat I this sectio, we propose the Adaptive Maximum Joit Relevat (AMJR) feature selectio algorithm to tackle the istability problem i MJR. Similar to MJR, we use the criterio i (5) to iteratively select the most relevat features. However, we propose a ew scheme to estimate the mutual iformatios which stabilize the algorithm i small traiig set regimes. We build our estimatio techique based o fuctioal estimatio method proposed i [4]. Specifically, i order to 2

estimate I(X j, X i ; Y ) at each step, we have to estimate the joit etropies accordig to the followig idetity: I(X j, X i ; Y ) = H(X j, X i ) + H(Y ) H(X j, X i, Y ). (6) I order to describe the estimatio method i AMJR, cosider for example, estimatig H(X j, X i ). Followig from [4], first the empirical joit distributio of (X j, X i ) is computed accordig to ˆP a,b = 1 1{(X j, X i ) (t) = (a, b)}, (7) t=1 where is the size of traiig set ad (X j, X i ) (t) is the joit value of t th traiig example. Note that a ad b are assumed to take value i some fite sets A ad B, respectively. Now, assumig that P a,b is the true joit probability of (X j, X i ) at poit (a, b), the true joit etropy would be H(X j, X i ) = a A, b B P a,b log P a,b. (8) I order to provide the estimator Ĥ(X j, X i ) of H(X j, X i ), oe aive way is substitute each P a,b i (8) with its estimate ˆP a,b. This method which is used i MJR, is i fact the source of istability o the performace sice most of the estimated probabilities are very small. I AMJR, we cosider two cases for the estimated joit probabilities: If ˆP a,b log, we use it as a estimatio of P a,b i (8). If ˆP a,b < log, first we fit a polyomial f of order log to the fuctio x log x i the iterval (0, log ). The, we use f( ˆP a,b ) as a estimatio for P a,b log P a,b i (8). As we see i Sectio 4, the approximatio polyomial f itroduces stability to the algorithm ad improves its performace. Cosequetly, the estimatio of H(X j, X i ) i AMJR would be ( Ĥ(X j, X i ) = ˆP a,b log ˆP a,b + ) f( ˆP a,b ). (9) ˆP a,b log ˆP a,b < log Similarly, the estimatios Ĥ(X j, X i, Y ) ad Ĥ(Y ) are provided for H(X j, X i, Y ) ad H(Y ), respectively. Fially, the mutual iformatio is estimated as Î(X j, X i ; Y ) = Ĥ(X j, X i ) + Ĥ(Y ) Ĥ(X j, X i, Y ). (10) 4 Numerical Results I this sectio we provide umerical results to cofirm our theoretical aalysis. We perform differet feature selectio ad classificatio methods o the dataset Madelo released i NIPS 2003 feature selectio challege [5]. This data set cosists of 2000 samples each cotaiig 500 cotiuous iput features ad oe biary output respose. Here we have used 1400 samples (70%) as the traiig set ad used the other 600 samples (30%) as the test set. I order to explore the effect of sample size o differet feature selectio methods, we quatize the iput space ito 3 ad 5 levels, uiformly. Thus, we have two scearios. I the first oe, the iput features are quatized separately ito three levels which correspods to the large traiig set regime 3

0.5 5 MRMR MJR classificatio error rate 5 10 15 20 # selected features Figure 1: SVM classificatio error for 3-level quatizatio of iput space. (sice each level happes too may times ad we have small umber of probabilities to estimate). I the secod sceario, the iput features are quatized separately ito 5 levels. The later sceario correspods to a small traiig set regime where there are a large umber of probabilities to estimate. Figure 1 compares the misclassificatio error of MRMR ad MJR feature selectio algorithms for differet umber of features. Here, SVM is used as the classificatio method ad the iput space is quatized ito 3 levels. Sice this sceario correspods to large traiig set regime, the MJR outperforms MRMR as depicted i the figure. I Fig. 2, the SVM misclassificatio error of MJR ad AMJR has bee compared for differet umber of selected features. Here, the iput space is quatized ito 5 level which correspods to the small traiig set sceario. As depicted i this figure, MJR has ustable performace i this sceario while AMJR shows stable ad better performace. This figure cofirms our theoretical aalysis of istability of MJR ad shows that our proposed method (AMJR) removes the istability problem almost completely. The advatage of the proposed method AMJR method is further described i Fig. 3. I this figure, the SVM misclassificatio error of AMJR ad MRMR methods are compared for differt umber of selected features. Here, the iput space is quatized ito 5 levels (small traiig set regime). As depicted i this figure, AMJR substatially outperforms MRMR for ay umber of 0.5 5 AMJR MJR classificatio error rate 0.15 0.1 4 6 8 10 12 14 16 18 20 22 # selected features Figure 2: SVM classificatio error for 5-level quatizatio of iput space. 4

0.5 5 AMJR MRMR classificatio error 0.15 0.1 5 10 15 20 # Selected Features Figure 3: SVM classificatio error for 5-level quatizatio of iput space. selected features. It worth metioig that other tha SVM, we have also repeated the above experimets for logistic regressio ad classificatio trees ad the same relative results were obtaied. Sice our focus is o comparig the feature selectio algorithms (ad ot the classificatio methods), ad also due to the lack of space, the results for these methods are ot provided here. 5 Coclusio Feature selectio is a idispesable part of solutio whe dealig with high-dimesioal datasets. Oe powerful tool to address this problem is mutual iformatio. A commo approach is to use Maximum Relevace Miimum Redudacy (MRMR) approach to solve the feature selectio problem. I this paper, based o isight from iformatio theory, a ew objective fuctio is used. Also, a ovel mutual iformatio estimator is used eablig us to discretize the data ito fier levels. Combiig the ovel mutual iformatio estimator with the ew objective fuctio, a error rate 3 times lower tha that of MRMR is demostrated. Refereces [1] T. Cover, ad J. Thomas. Elemets of iformatio theory, Joh Wiley & Sos, 2012. [2] H. Peg, H. Log, ad C. Dig, Feature selectio based o mutual iformatio criteria of maxdepedecy, max-relevace, ad mi-redudacy. Patter Aalysis ad Machie Itelligece, IEEE Trasactios o 27.8, 2005, 1226-1238. [3] H. Yag, ad J. Moody. Data Visualizatio ad Feature Selectio: New Algorithms for Nogaussia Data. NIPS. 1999. [4] J. Jiao, K. Vekat, Y. Ha, T. Weissma, Miimax Estimatio of Fuctioals of Discrete Distributios, available o arxiv. 2014. [5] Available olie: http://www.ipsfsc.ecs.soto.ac.uk/datasets 5