Filter Methods. Part I : Basic Principles and Methods

Similar documents
Iterative Laplacian Score for Feature Selection

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Your Project Proposals

Model Accuracy Measures

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Statistical aspects of prediction models with high-dimensional data

Machine Learning (CS 567) Lecture 2

Support Vector Machines

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

SF2930 Regression Analysis

Classification: The rest of the story

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

CMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th

ECE521 Lecture7. Logistic Regression

Statistical Distribution Assumptions of General Linear Models

Some Thoughts at the Interface of Ensemble Methods & Feature Selection. Gavin Brown School of Computer Science University of Manchester, UK

Linear Classification and SVM. Dr. Xin Zhang

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

This gives us an upper and lower bound that capture our population mean.

Stochastic calculus for summable processes 1

Lecture 7: DecisionTrees

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Introduction to Signal Detection and Classification. Phani Chavali

Business Statistics. Lecture 9: Simple Regression

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

UVA CS 4501: Machine Learning

Similarity and recommender systems

16.400/453J Human Factors Engineering. Design of Experiments II

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Learning with multiple models. Boosting.

Proteomics and Variable Selection

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Regression I: Mean Squared Error and Measuring Quality of Fit

Machine Learning. Lecture 9: Learning Theory. Feng Li.

CMSC858P Supervised Learning Methods

BAYESIAN DECISION THEORY

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Decision trees COMS 4771

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Physics 509: Non-Parametric Statistics and Correlation Testing

the tree till a class assignment is reached

Lecture #11: Classification & Logistic Regression

20 Unsupervised Learning and Principal Components Analysis (PCA)

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Machine Learning for Software Engineering

Parameter Estimation, Sampling Distributions & Hypothesis Testing

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Gaussian and Linear Discriminant Analysis; Multiclass Classification

WALD LECTURE II LOOKING INSIDE THE BLACK BOX. Leo Breiman UCB Statistics

Classification and Regression Trees

Applied Machine Learning Annalisa Marsico

TDT4173 Machine Learning

CSE446: non-parametric methods Spring 2017

Interpreting Deep Classifiers

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

PAC-learning, VC Dimension and Margin-based Bounds

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Classification. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

The Naïve Bayes Classifier. Machine Learning Fall 2017

Lecture 3: Introduction to Complexity Regularization

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Linear Models in Machine Learning

MSc Project Feature Selection using Information Theoretic Techniques. Adam Pocock

Microarray Data Analysis: Discovery

Decision Tree Learning Lecture 2

Machine Learning Linear Classification. Prof. Matteo Matteucci

Lecture 1: Introduction, Entropy and ML estimation

12.7. Scattergrams and Correlation

Dynamics in Social Networks and Causality

Machine Learning 2nd Edition

Discrete Multivariate Statistics

9/2/2010. Wildlife Management is a very quantitative field of study. throughout this course and throughout your career.

Generalized Linear Models for Non-Normal Data

Information Theory Primer:

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Machine detection of emotions: Feature Selection

Feature selection and extraction Spectral domain quality estimation Alternatives

Counterfactual Model for Learning Systems

Machine Learning for OR & FE

Statistical Data Mining and Machine Learning Hilary Term 2016

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Announcements Kevin Jamieson

Machine Learning and Data Mining. Linear classification. Kalev Kask

Advanced statistical methods for data analysis Lecture 1

MACHINE LEARNING FOR CAUSE-EFFECT PAIRS DETECTION. Mehreen Saeed CLE Seminar 11 February, 2014.

Summary of Chapters 7-9

A Bias Correction for the Minimum Error Rate in Cross-validation

Big Data Analysis with Apache Spark UC#BERKELEY

Data Mining Stat 588

Transcription:

Filter Methods Part I : Basic Principles and Methods

Feature Selection: Wrappers Input: large feature set Ω 10 Identify candidate subset S Ω 20 While!stop criterion() Evaluate error of a classifier using S. Adapt subset S. 30 Return S. Pros: excellent performance for the chosen classifier Cons: computationally and memory-intense

Feature Selection: Filters Input: large feature set Ω 10 Identify candidate subset S Ω 20 While!stop criterion() Evaluate utility function J using S. Adapt subset S. 30 Return S. Pros: fast, provides generically useful feature set Cons: generally higher error than wrappers

Types of Filters A filter evaluates statistics of the data Univariate filters evaluate each feature independently. Multivariate filters evaluate features in context of others. also... Some data is ordered. e.g. 1,2,3 Some is not, e.g. dog, cat, sheep (i.e. categorical) A filter statistic must take this into account. Today we mostly look at numerical (ordered) data.

How useful is a single feature? : Univariate filters Trying to predict someone s Biology exam grade from various possible indicators (a.k.a. features): (1)Chemistry grade, (2)History grade, (3)Biology Mock exam grade, or (4)Height... Which one would you pick?

Pearson s Correlation Coefficient Feature : x k = {x (1) k,..., x(n) Target : y = {y (1),..., y (N) } T k } T r(x, y) = N i=1 (x(i) x)(y (i) ȳ) N N i=1 (x(i) x) 2 i=1 (y(i) ȳ) 2 r = +0.5 r = 0.0 r = 0.5 Both positive and negative correlation is useful!

Pearson s Correlation Coefficient x k = {x (1) k,..., x(n) k } k = 1..M y = {y (1),..., y (N) } The estimated utility for feature X k is: J(X k ) = r(x k, y) (i.e. absolute correlation with target) Algorithm 10. Rank features in descending order by J. 20. Evaluate predictor on M nested subsets. 30. Choose subset with lowest validation error. Features are ranked by their score J.

Ranking with Filter Criteria Rank features X i, i by their values of J(X k ). Retain the highest ranked features, discard the lowest ranked. k J(X k ) 35 0.846 42 0.811 10 0.810 654 0.611 22 0.443 59 0.388...... 212 0.09 39 0.05 Cut-off point decided by user, e.g. S = 5, so S = {35, 42, 10, 654, 22}. Or by cross-validation.

Limitations... Pearson assumes all features are INDEPENDENT! and... only detects LINEAR correlations...

Pearson s Correlation Coefficient With binary y, Pearson corresponds to linear separability. 1 1 0.8 0.8 Class Label 0.6 0.4 r = 0.15256 Class Label 0.6 0.4 r = 0.86652 0.2 0.2 0 0 0.2 4 3 2 1 0 1 2 3 4 Feature Value 0.2 4 3 2 1 0 1 2 3 4 Feature Value

Pearson s Correlation Coefficient And... 1 1 0.8 0.8 Class Label 0.6 0.4 r = 0.99357 Class Label 0.6 0.4 r = 0.10948 0.2 0.2 0 0 0.2 4 3 2 1 0 1 2 3 4 Feature Value 0.2 4 3 2 1 0 1 2 3 4 Feature Value Beware multi-class problems!... Why?

Fisher Score Something a little more sensible for classification problems: J(X k ) = (µ(y +) µ(y )) 2 σ(y + ) 2 + σ(y ) 2 Maximum between class variance (difference of means). Minimum within class variance (sum of variances).

Mutual Information What if we have categorical variables? X is relevant to Y if they are dependent, i.e. p(xy) p(x)p(y) So let s measure the KL-divergence between these distributions: J(X k ) = I(X k ; Y ) = x X k y Y Again, RANK features by their score J. p(xy) log p(xy) p(x)p(y) We will see more of this in the next lecture.

There are LOTS of ranking criteria... Many produce very similar rankings... W.Duch, Filter Methods, ch2, Feature Extraction: Foundations and Applications

There are LOTS of ranking criteria... Pearson, Fisher, Mutual Info, Jeffreys-Matsusita, Gini Index, AUC, F-measure, Kolmogorov distance, Chi-squared, CFS, Alpha-divergence, Symmetrical Uncertainty,... etc, etc How do I pick!? Unfortunately, quite complex... depends on: - type of variables/targets (continuous, discrete, categorical). - class distribution - degree of nonlinearity/feature interaction - amount of available data And ultimately... the No Free Lunch theorem applies. There are no relevancy definitions independent of the learner or error measure that solve the feature selection problem Tsamardinos et al, Towards Principled Feature Selection: Relevancy, Filters and Wrappers, AISTATS 2003

Ranking criteria have been studied for a long time... Some of the coolest stuff was done a long time ago! Still possible to learn from it! J. Kittler Mathematical Methods of Feature Selection in Pattern Recognition. International Journal of Man-Machine Studies, vol 7(5), (1975). D.Boekee & J. Van Der Lubbe Some Aspects of Error Bounds in Feature Selection, Pattern Recognition, vol 11 (1978) W. McGill Multivariate information transmission, Psychometrika 19, 97-116. (1954) Some ideas published in the 2000s were done first in the 1970s!

Significance of Pearson s Correlation Coefficient 0.5 Minimum Correlation for 95% confidence 0.4 0.3 0.2 0.1 0 0 100 200 300 400 500 Number of examples Example reading of the above graph: Correlation r = 0.2 with less than 100 examples is statistically insignificant. We need at least 100 examples to know whether r = 0.2 is not due to chance.

Pearson s Correlation Coefficient x k = {x (1) k,..., x(n) k } k = 1..M y = {y (1),..., y (N) } Algorithm 10. Rank features in descending order by J. 15. Remove statistically insignificant features. 20. Evaluate predictor on M nested subsets. 30. Choose subset with lowest validation error.

Search Space : Wrappers Evaluates M(M+1) 2 feature subsets.

Search Space : Filter Ranking Methods Ranking provided by criterion, hence no need to search.

Things to Remember In general, features work in combination... It doesn t look like either the X or Y axis here is very useful. But if we have both together... perfect separation... I.Guyon et al, An Introduction to Variable and Feature Selection, JMLR 2004.

Things to Remember Features can be individually completely irrelevant, and only useful when combined with others I.Guyon et al, An Introduction to Variable and Feature Selection, JMLR 2004.

Things to Remember We re not just dealing with 2 dimensions / features... This is known as the chessboard data, and corresponds to XOR. X1 X2 Y 0 0 0 0 1 1 1 0 1 1 1 0

Things to Remember We re not just dealing with 2 dimensions / features... but XOR is a special case of the odd parity problem... X1 X2 X3 Y 0 0 0 0 0 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 0 0 1 1 1 1 Pearson, Mutual Info, etc, all return J(X k ) = 0 for all features.

Things to Remember But how realistic is parity data!? X1 X2 X3 Y 0 0 0 0 0 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 0 0 1 1 1 1 Very... current theories of gene regulatory networks depend on it... Analysis of Functional Genomic Signals Using the XOR Gate Yaragatti M, Wen Q, PLoS ONE 4(5), 2009

Key Point The relevance of a feature can only be fairly assessed in the context of other features. Independent ranking criteria are FAST, but naive, being uni-variate. Not all filter methods are naive. Some use context. These are multi-variate filters.

RELIEF (Kira & Rendell, 1992) Classic filter method, very popular. If Dhit Dmiss... BAD feature!

RELIEF algorithm 10. Set all weights w(i) := 0 20. For t := 1 to T 30. Randomly select an instance 40. Find nearest hit H and nearest miss M 50. For each feature i, 60. w(i) w(i) + D miss D hit 70. End 80. End D miss = (x i x (M) i ) 2 i ) 2 max(x i ) min(x i ) max(x i ) min(x i ) D hit = (x i x (H) Stochastic! Can be made deterministic by T = D. RELIEF is computationally more expensive than Pearson.

Pearson versus Relief Breast Cancer data : 20 bootstraps, 1-NN classifier. Data rescaled to mean zero, variance one. 0.2 0.18 0.16 Pearson Relief OOB error 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 5 10 15 20 25 30 Num features Pearson statistically insignificant after 26 features. Notice Pearson beats Relief in early stages. Why?

Pearson versus Relief - The Effect of Feature Scaling. Scaling of features affects the outcome of RELIEF! OOB error 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 Pearson Relief 0 0 5 10 15 20 25 30 Num features OOB error 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 Pearson Relief 0 0 5 10 15 20 25 30 Num features Scaled (left) versus unscaled data (right). NOTE: Pearson is not affected by scaling (correlation is scale-invariant)...but the subsequent K-NN is affected. Relief IS affected by scaling, and also the K-NN, hence much larger variance.

The Pattern Recognition Pipeline Coupling at all stages. Data FS : Relief does not cope well with unscaled data. FS Classifier : If classifier cannot make use of features, no hope. Data Classifier : Rescaling affects many classifiers. Even coupling at error stage - what about class imbalance?

Modified RELIEF Use ratio instead of difference. Sum over ALL patterns. 50. For each feature i, 60. w(i) w(i) + x D ( x x(m) x x (H) ) 70. End Avoids scaling issue, but does behave differently than the original. Also loses strong theory links to margin maximisation.

Categorical features? {Dog, Cat, Sheep} has no intrinsic ordering. So, nearest hit/miss are ill-defined. Could use 1-of-C representation, but seems unsatisfactory... Mutual Information to the rescue! NEXT LECTURE :-)