Information-based Feature Selection

Iformatio-based Feature Selectio Farza Faria, Abbas Kazeroui, Afshi Babveyh Email: {faria,abbask,afshib}@staford.edu 1 Itroductio Feature selectio is a topic of great iterest i applicatios dealig with high-dimesioal datasets. These applicatios iclude gee expressio array aalysis, combiatorial chemistry ad text processig of olie documets. Usig feature selectio brigs about several advatages. First, it leads to lower computatioal cost ad time. Less memory is eeded to store the data ad less processig power is eeded. Feature selectio helps improve the performace of the predictors by avoidig overfittig. It ca also capture the uderlyig coectio betwee the data. Ad perhaps the most importat aspect, it ca break through the barrier of high-dimesioality. To select the most relevat subset of features, we eed a mathematical tool to measure depedece amog radom variables. I this work, we use the cocept of mutual iformatio. Mutual iformatio is a well-kow depedece measure i iformatio theory. For ay arbitrary pair of discrete radom variables, X X ad Y Y, Mutual Iformatio is defied as I(X; Y ) = x X, y Y p X,Y (x, y) log p X,Y (x, y) p X (x) p Y (y). (1) The paper is orgaized as follows. I sectio 2 the method of Maximum-Relevace Miimum- Redudacy (MRMR) is preseted alog with Maximum Joit Relevat (MJR) method. I sectio 3, we preset our method to solve the feature selectio problem. Sectio 4 presets the result of our algorithm tested o Madelo dataset. Fially, sectio 5 discusses the coclusio. 2 Mutual Iformatio as a tool for Feature Selectio As discussed earlier, mutual iformatio is a powerful tool i measurig relevace amog radom variables. Hece, it ca be a useful mathematical tool to fid ad select relevat features. I other words, if our goal is to select o more tha k features a optimal task is to solve arg max S =k I(X S; Y ), (2) where X S = {X i : i S}. However, as k gets larger our estimatio of mutual iformatio becomes less accurate. It is because for large k s we do ot have eough samples to estimate mutual iformatio accurately. Hece, the objective fuctio i (2) should be modified so that it becomes estimable by available samples. I the ext sectios, we first discuss a past approach to solve this issue ad the propose a ew solutio to improve such approaches. 1

2.1 Max-Relevace Mi-Redudacy (MRMR) approach As metioed earlier, we aim to idetify the most relevat subset of features whose size is limited to a give factor. Note that this is ot the same as characterizig the k best features with the most idividual mutual iformatio to the target Y. I fact, differet features may share redudat iformatio o the target. Thus, redudacy is aother importat factor to be cosidered i feature selectio. To balace the trade-off betwee relevace ad redudacy, the followig modified objective fuctio (MRMR) has bee suggested i [2]: Φ(X S, Y ) = 1 S i S I(X i ; Y ) 1 S 2 i,j S I(X i ; X j ). (3) Here, the first term measures the average relevace of features to the target, while the secod term measures average pairwise redudacy amog selected features. Therefore, maximizig Φ(X S, Y ) leads to idetifyig a well-characterizig feature subset whose total iformatio o the target is close to the optimal feature subset s. To maximize this objective, they used a iductive approach where first the most iformative feature is chose, ad the ext features are iductively added by solvig the followig at every step: 2.2 Maximum Joit Relevace arg max I(X j ; Y ) 1 I(X j ; X i ). (4) X j X\S m m 1 X i S m Although MRMR is a well-kow feature selectio method, there are several applicatios where the test error rate ever goes below some large thresholds like 34% which seems quite usatisfactory. Note that (3) icludes oly up to pairwise iteractios. By cosiderig higher order iteractios we ca become able to select a more iformative feature subset which i tur results i smaller error rates. To this ed, Maximum Joit Relevat (MJR) algorithm chages the iductive rule of (4) to a more sesitive oe [3]: arg max I(X j, X i ; Y ). (5) X j X\S m X i S m Nevertheless, we may agai ecouter the issue of lack of eough samples to estimate the secod order mutual iformatio appeared i the above formulatio. As a matter of fact, a cosiderable umber of third order empirical margials may become too small, ad thus it requires a more accurate estimatio of mutual iformatio tha the empirical oe. Therefore, i ext sectio we are goig to propose a ew algorithm to estimate mutual iformatio with higher accuracy. As a importat advatage, this estimatio techique reduces the required sample size to estimate mutual iformatio withi the same accuracy. 3 Adaptive Maximum Joit Relevat I this sectio, we propose the Adaptive Maximum Joit Relevat (AMJR) feature selectio algorithm to tackle the istability problem i MJR. Similar to MJR, we use the criterio i (5) to iteratively select the most relevat features. However, we propose a ew scheme to estimate the mutual iformatios which stabilize the algorithm i small traiig set regimes. We build our estimatio techique based o fuctioal estimatio method proposed i [4]. Specifically, i order to 2

estimate I(X j, X i ; Y ) at each step, we have to estimate the joit etropies accordig to the followig idetity: I(X j, X i ; Y ) = H(X j, X i ) + H(Y ) H(X j, X i, Y ). (6) I order to describe the estimatio method i AMJR, cosider for example, estimatig H(X j, X i ). Followig from [4], first the empirical joit distributio of (X j, X i ) is computed accordig to ˆP a,b = 1 1{(X j, X i ) (t) = (a, b)}, (7) t=1 where is the size of traiig set ad (X j, X i ) (t) is the joit value of t th traiig example. Note that a ad b are assumed to take value i some fite sets A ad B, respectively. Now, assumig that P a,b is the true joit probability of (X j, X i ) at poit (a, b), the true joit etropy would be H(X j, X i ) = a A, b B P a,b log P a,b. (8) I order to provide the estimator Ĥ(X j, X i ) of H(X j, X i ), oe aive way is substitute each P a,b i (8) with its estimate ˆP a,b. This method which is used i MJR, is i fact the source of istability o the performace sice most of the estimated probabilities are very small. I AMJR, we cosider two cases for the estimated joit probabilities: If ˆP a,b log, we use it as a estimatio of P a,b i (8). If ˆP a,b < log, first we fit a polyomial f of order log to the fuctio x log x i the iterval (0, log ). The, we use f( ˆP a,b ) as a estimatio for P a,b log P a,b i (8). As we see i Sectio 4, the approximatio polyomial f itroduces stability to the algorithm ad improves its performace. Cosequetly, the estimatio of H(X j, X i ) i AMJR would be ( Ĥ(X j, X i ) = ˆP a,b log ˆP a,b + ) f( ˆP a,b ). (9) ˆP a,b log ˆP a,b < log Similarly, the estimatios Ĥ(X j, X i, Y ) ad Ĥ(Y ) are provided for H(X j, X i, Y ) ad H(Y ), respectively. Fially, the mutual iformatio is estimated as Î(X j, X i ; Y ) = Ĥ(X j, X i ) + Ĥ(Y ) Ĥ(X j, X i, Y ). (10) 4 Numerical Results I this sectio we provide umerical results to cofirm our theoretical aalysis. We perform differet feature selectio ad classificatio methods o the dataset Madelo released i NIPS 2003 feature selectio challege [5]. This data set cosists of 2000 samples each cotaiig 500 cotiuous iput features ad oe biary output respose. Here we have used 1400 samples (70%) as the traiig set ad used the other 600 samples (30%) as the test set. I order to explore the effect of sample size o differet feature selectio methods, we quatize the iput space ito 3 ad 5 levels, uiformly. Thus, we have two scearios. I the first oe, the iput features are quatized separately ito three levels which correspods to the large traiig set regime 3

0.5 5 MRMR MJR classificatio error rate 5 10 15 20 # selected features Figure 1: SVM classificatio error for 3-level quatizatio of iput space. (sice each level happes too may times ad we have small umber of probabilities to estimate). I the secod sceario, the iput features are quatized separately ito 5 levels. The later sceario correspods to a small traiig set regime where there are a large umber of probabilities to estimate. Figure 1 compares the misclassificatio error of MRMR ad MJR feature selectio algorithms for differet umber of features. Here, SVM is used as the classificatio method ad the iput space is quatized ito 3 levels. Sice this sceario correspods to large traiig set regime, the MJR outperforms MRMR as depicted i the figure. I Fig. 2, the SVM misclassificatio error of MJR ad AMJR has bee compared for differet umber of selected features. Here, the iput space is quatized ito 5 level which correspods to the small traiig set sceario. As depicted i this figure, MJR has ustable performace i this sceario while AMJR shows stable ad better performace. This figure cofirms our theoretical aalysis of istability of MJR ad shows that our proposed method (AMJR) removes the istability problem almost completely. The advatage of the proposed method AMJR method is further described i Fig. 3. I this figure, the SVM misclassificatio error of AMJR ad MRMR methods are compared for differt umber of selected features. Here, the iput space is quatized ito 5 levels (small traiig set regime). As depicted i this figure, AMJR substatially outperforms MRMR for ay umber of 0.5 5 AMJR MJR classificatio error rate 0.15 0.1 4 6 8 10 12 14 16 18 20 22 # selected features Figure 2: SVM classificatio error for 5-level quatizatio of iput space. 4

0.5 5 AMJR MRMR classificatio error 0.15 0.1 5 10 15 20 # Selected Features Figure 3: SVM classificatio error for 5-level quatizatio of iput space. selected features. It worth metioig that other tha SVM, we have also repeated the above experimets for logistic regressio ad classificatio trees ad the same relative results were obtaied. Sice our focus is o comparig the feature selectio algorithms (ad ot the classificatio methods), ad also due to the lack of space, the results for these methods are ot provided here. 5 Coclusio Feature selectio is a idispesable part of solutio whe dealig with high-dimesioal datasets. Oe powerful tool to address this problem is mutual iformatio. A commo approach is to use Maximum Relevace Miimum Redudacy (MRMR) approach to solve the feature selectio problem. I this paper, based o isight from iformatio theory, a ew objective fuctio is used. Also, a ovel mutual iformatio estimator is used eablig us to discretize the data ito fier levels. Combiig the ovel mutual iformatio estimator with the ew objective fuctio, a error rate 3 times lower tha that of MRMR is demostrated. Refereces [1] T. Cover, ad J. Thomas. Elemets of iformatio theory, Joh Wiley & Sos, 2012. [2] H. Peg, H. Log, ad C. Dig, Feature selectio based o mutual iformatio criteria of maxdepedecy, max-relevace, ad mi-redudacy. Patter Aalysis ad Machie Itelligece, IEEE Trasactios o 27.8, 2005, 1226-1238. [3] H. Yag, ad J. Moody. Data Visualizatio ad Feature Selectio: New Algorithms for Nogaussia Data. NIPS. 1999. [4] J. Jiao, K. Vekat, Y. Ha, T. Weissma, Miimax Estimatio of Fuctioals of Discrete Distributios, available o arxiv. 2014. [5] Available olie: http://www.ipsfsc.ecs.soto.ac.uk/datasets 5