Filter Methods Part I : Basic Principles and Methods
Feature Selection: Wrappers Input: large feature set Ω 10 Identify candidate subset S Ω 20 While!stop criterion() Evaluate error of a classifier using S. Adapt subset S. 30 Return S. Pros: excellent performance for the chosen classifier Cons: computationally and memory-intense
Feature Selection: Filters Input: large feature set Ω 10 Identify candidate subset S Ω 20 While!stop criterion() Evaluate utility function J using S. Adapt subset S. 30 Return S. Pros: fast, provides generically useful feature set Cons: generally higher error than wrappers
Types of Filters A filter evaluates statistics of the data Univariate filters evaluate each feature independently. Multivariate filters evaluate features in context of others. also... Some data is ordered. e.g. 1,2,3 Some is not, e.g. dog, cat, sheep (i.e. categorical) A filter statistic must take this into account. Today we mostly look at numerical (ordered) data.
How useful is a single feature? : Univariate filters Trying to predict someone s Biology exam grade from various possible indicators (a.k.a. features): (1)Chemistry grade, (2)History grade, (3)Biology Mock exam grade, or (4)Height... Which one would you pick?
Pearson s Correlation Coefficient Feature : x k = {x (1) k,..., x(n) Target : y = {y (1),..., y (N) } T k } T r(x, y) = N i=1 (x(i) x)(y (i) ȳ) N N i=1 (x(i) x) 2 i=1 (y(i) ȳ) 2 r = +0.5 r = 0.0 r = 0.5 Both positive and negative correlation is useful!
Pearson s Correlation Coefficient x k = {x (1) k,..., x(n) k } k = 1..M y = {y (1),..., y (N) } The estimated utility for feature X k is: J(X k ) = r(x k, y) (i.e. absolute correlation with target) Algorithm 10. Rank features in descending order by J. 20. Evaluate predictor on M nested subsets. 30. Choose subset with lowest validation error. Features are ranked by their score J.
Ranking with Filter Criteria Rank features X i, i by their values of J(X k ). Retain the highest ranked features, discard the lowest ranked. k J(X k ) 35 0.846 42 0.811 10 0.810 654 0.611 22 0.443 59 0.388...... 212 0.09 39 0.05 Cut-off point decided by user, e.g. S = 5, so S = {35, 42, 10, 654, 22}. Or by cross-validation.
Limitations... Pearson assumes all features are INDEPENDENT! and... only detects LINEAR correlations...
Pearson s Correlation Coefficient With binary y, Pearson corresponds to linear separability. 1 1 0.8 0.8 Class Label 0.6 0.4 r = 0.15256 Class Label 0.6 0.4 r = 0.86652 0.2 0.2 0 0 0.2 4 3 2 1 0 1 2 3 4 Feature Value 0.2 4 3 2 1 0 1 2 3 4 Feature Value
Pearson s Correlation Coefficient And... 1 1 0.8 0.8 Class Label 0.6 0.4 r = 0.99357 Class Label 0.6 0.4 r = 0.10948 0.2 0.2 0 0 0.2 4 3 2 1 0 1 2 3 4 Feature Value 0.2 4 3 2 1 0 1 2 3 4 Feature Value Beware multi-class problems!... Why?
Fisher Score Something a little more sensible for classification problems: J(X k ) = (µ(y +) µ(y )) 2 σ(y + ) 2 + σ(y ) 2 Maximum between class variance (difference of means). Minimum within class variance (sum of variances).
Mutual Information What if we have categorical variables? X is relevant to Y if they are dependent, i.e. p(xy) p(x)p(y) So let s measure the KL-divergence between these distributions: J(X k ) = I(X k ; Y ) = x X k y Y Again, RANK features by their score J. p(xy) log p(xy) p(x)p(y) We will see more of this in the next lecture.
There are LOTS of ranking criteria... Many produce very similar rankings... W.Duch, Filter Methods, ch2, Feature Extraction: Foundations and Applications
There are LOTS of ranking criteria... Pearson, Fisher, Mutual Info, Jeffreys-Matsusita, Gini Index, AUC, F-measure, Kolmogorov distance, Chi-squared, CFS, Alpha-divergence, Symmetrical Uncertainty,... etc, etc How do I pick!? Unfortunately, quite complex... depends on: - type of variables/targets (continuous, discrete, categorical). - class distribution - degree of nonlinearity/feature interaction - amount of available data And ultimately... the No Free Lunch theorem applies. There are no relevancy definitions independent of the learner or error measure that solve the feature selection problem Tsamardinos et al, Towards Principled Feature Selection: Relevancy, Filters and Wrappers, AISTATS 2003
Ranking criteria have been studied for a long time... Some of the coolest stuff was done a long time ago! Still possible to learn from it! J. Kittler Mathematical Methods of Feature Selection in Pattern Recognition. International Journal of Man-Machine Studies, vol 7(5), (1975). D.Boekee & J. Van Der Lubbe Some Aspects of Error Bounds in Feature Selection, Pattern Recognition, vol 11 (1978) W. McGill Multivariate information transmission, Psychometrika 19, 97-116. (1954) Some ideas published in the 2000s were done first in the 1970s!
Significance of Pearson s Correlation Coefficient 0.5 Minimum Correlation for 95% confidence 0.4 0.3 0.2 0.1 0 0 100 200 300 400 500 Number of examples Example reading of the above graph: Correlation r = 0.2 with less than 100 examples is statistically insignificant. We need at least 100 examples to know whether r = 0.2 is not due to chance.
Pearson s Correlation Coefficient x k = {x (1) k,..., x(n) k } k = 1..M y = {y (1),..., y (N) } Algorithm 10. Rank features in descending order by J. 15. Remove statistically insignificant features. 20. Evaluate predictor on M nested subsets. 30. Choose subset with lowest validation error.
Search Space : Wrappers Evaluates M(M+1) 2 feature subsets.
Search Space : Filter Ranking Methods Ranking provided by criterion, hence no need to search.
Things to Remember In general, features work in combination... It doesn t look like either the X or Y axis here is very useful. But if we have both together... perfect separation... I.Guyon et al, An Introduction to Variable and Feature Selection, JMLR 2004.
Things to Remember Features can be individually completely irrelevant, and only useful when combined with others I.Guyon et al, An Introduction to Variable and Feature Selection, JMLR 2004.
Things to Remember We re not just dealing with 2 dimensions / features... This is known as the chessboard data, and corresponds to XOR. X1 X2 Y 0 0 0 0 1 1 1 0 1 1 1 0
Things to Remember We re not just dealing with 2 dimensions / features... but XOR is a special case of the odd parity problem... X1 X2 X3 Y 0 0 0 0 0 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 0 0 1 1 1 1 Pearson, Mutual Info, etc, all return J(X k ) = 0 for all features.
Things to Remember But how realistic is parity data!? X1 X2 X3 Y 0 0 0 0 0 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 0 0 1 1 1 1 Very... current theories of gene regulatory networks depend on it... Analysis of Functional Genomic Signals Using the XOR Gate Yaragatti M, Wen Q, PLoS ONE 4(5), 2009
Key Point The relevance of a feature can only be fairly assessed in the context of other features. Independent ranking criteria are FAST, but naive, being uni-variate. Not all filter methods are naive. Some use context. These are multi-variate filters.
RELIEF (Kira & Rendell, 1992) Classic filter method, very popular. If Dhit Dmiss... BAD feature!
RELIEF algorithm 10. Set all weights w(i) := 0 20. For t := 1 to T 30. Randomly select an instance 40. Find nearest hit H and nearest miss M 50. For each feature i, 60. w(i) w(i) + D miss D hit 70. End 80. End D miss = (x i x (M) i ) 2 i ) 2 max(x i ) min(x i ) max(x i ) min(x i ) D hit = (x i x (H) Stochastic! Can be made deterministic by T = D. RELIEF is computationally more expensive than Pearson.
Pearson versus Relief Breast Cancer data : 20 bootstraps, 1-NN classifier. Data rescaled to mean zero, variance one. 0.2 0.18 0.16 Pearson Relief OOB error 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 5 10 15 20 25 30 Num features Pearson statistically insignificant after 26 features. Notice Pearson beats Relief in early stages. Why?
Pearson versus Relief - The Effect of Feature Scaling. Scaling of features affects the outcome of RELIEF! OOB error 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 Pearson Relief 0 0 5 10 15 20 25 30 Num features OOB error 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 Pearson Relief 0 0 5 10 15 20 25 30 Num features Scaled (left) versus unscaled data (right). NOTE: Pearson is not affected by scaling (correlation is scale-invariant)...but the subsequent K-NN is affected. Relief IS affected by scaling, and also the K-NN, hence much larger variance.
The Pattern Recognition Pipeline Coupling at all stages. Data FS : Relief does not cope well with unscaled data. FS Classifier : If classifier cannot make use of features, no hope. Data Classifier : Rescaling affects many classifiers. Even coupling at error stage - what about class imbalance?
Modified RELIEF Use ratio instead of difference. Sum over ALL patterns. 50. For each feature i, 60. w(i) w(i) + x D ( x x(m) x x (H) ) 70. End Avoids scaling issue, but does behave differently than the original. Also loses strong theory links to margin maximisation.
Categorical features? {Dog, Cat, Sheep} has no intrinsic ordering. So, nearest hit/miss are ill-defined. Could use 1-of-C representation, but seems unsatisfactory... Mutual Information to the rescue! NEXT LECTURE :-)