Conditional Likelihood Maximization: A Unifying Framework for Information Theoretic Feature Selection

Size: px

Start display at page:

Download "Conditional Likelihood Maximization: A Unifying Framework for Information Theoretic Feature Selection"

Malcolm Beasley
6 years ago
Views:

1 Conditional Likelihood Maximization: A Unifying Framework for Information Theoretic Feature Selection Gavin Brown, Adam Pocock, Mingjie Zhao and Mikel Lujan School of Computer Science University of Manchester Presented by Wenzhao Lian July 27, 2012

2 Outline 1 Main Contribution 2 Background 3 Main work 4 Experiments 5 Conclusion

3 Main Contribution Feature selection problem: selecting the feature set which is most relevant and least redundant. What s the criterion of selection? Existing criteria provide scoring functions to measure relevancy and redundancy of features.

4 Main Contribution Feature selection problem: selecting the feature set which is most relevant and least redundant. What s the criterion of selection? Existing criteria provide scoring functions to measure relevancy and redundancy of features. In this paper: Deriving a scoring function, instead of defining. Proposing a unifying framework for information theoretic feature selection. This general criterion can be naturally extended to existing criteria under different assumptions.

5 Outline 1 Main Contribution 2 Background 3 Main work 4 Experiments 5 Conclusion

6 Background Entropy and Mutual Information H(X) = x X p(x)logp(x) H(X Y ) = y Y p(y) x X p(x y)logp(x y) I(X; Y ) = H(X) H(X Y ) = p(xy)log p(xy) p(x)p(y) x X y Y I(X; Y Z ) = H(X Z ) H(X YZ ) = p(z) p(xy z) p(xy z)log p(x z)p(y z) z Z x X y Y (1)

7 Previous Feature Selection Criteria Mutual Information Maximization (MIM) J mim (X k ) = I(X k ; Y ) (2) J mim : relevance index. X k : k th feature. Y : class label. Mutual Information Feature Selection (MIFS) J mifs (X k ) = I(X k ; Y ) β X j S I(X k ; X j ) (3) J mifs : relevance index. S: set of currently selected features. β controlling redundancy penalty. Joint Mutual Information (JMI) J jmi (X k ) = X j S I(X k X j ; Y ) (4) Indicating that the candidate feature which is complementary with existing features should be included.

8 Outline 1 Main Contribution 2 Background 3 Main work 4 Experiments 5 Conclusion

9 Conditional Likelihood Problem D = {x i, y i ; i = 1..N} x i = [x i 1, x i 2,..., x i d ]T x = {x θ, x θ } τ: parameters used to predict y Conditional log likelihood of the labels given parameters θ, τ is l = 1 N N i=1 logq(y i x i θ, τ) (5)

10 Conditional Likelihood Problem Introduce p(y x θ ) and p(y x): the true distribution of the class labels given the selected features x θ and of the class labels given all features. l = 1 N N i=1 log q(y i x i θ, τ) p(y i x i θ ) + 1 N N i=1 log p(y i x i θ ) p(y i x i ) + 1 N N logp(y i x i ) (6) i=1

11 Conditional Likelihood Problem Introduce p(y x θ ) and p(y x): the true distribution of the class labels given the selected features x θ and of the class labels given all features. l = 1 N N i=1 log q(y i x i θ, τ) p(y i x i θ ) + 1 N N i=1 log p(y i x i θ ) p(y i x i ) + 1 N N logp(y i x i ) (6) i=1 Taking the limit, the objective function becomes minimizing l = E xy {log p(y x θ) q(y x θ, τ) } + I(X θ; Y X θ ) + H(Y X) (7) The first term depends on the model. The final term gives a lower bound on the Bayes error. Based on the Filter assumption, which means optimizing the feature set and optimizing the classifier are two independent stages, we can minimize the second term not caring about the first term.

12 Conditional Likelihood Problem For the second term, we have I(X θ ; Y X θ) = I(X; Y ) I(X θ ; Y ) (8) Thus, minimizing I(X θ ; Y X θ) equals to maximizing I(X θ ; Y ). Using the greedy approach First, initialize the selected set as a null set. Then, at each step the feature that has the highest score is selected. Repeat the second step until a stopping criterion is reached. S is the currently selected set, and the score for a feature X k is J cmi (X k ) = I(X k ; Y S) (9)

13 Unifying criteion To bring score functions proposed in previous work into this framework, three assumptions are needed. Assumption 1 For all unselected features X k X θ, assume p(x θ x k ) = j S p(x j x k ) p(x θ x k y) = j S p(x j x k y) (10) Under Assumption 1, an equivalent criterion can be written as J cmi (X k) = I(X k ; Y ) I(X j ; X k ) + I(X j ; X k Y ) (11) j S j S

14 Unifying criteion Assumption 2 For all features, assume Assumption 3 For all features, assume p(x i x j y) = p(x i y)p(x j y) (12) p(x i x j ) = p(x i )p(x j ) (13) Depending on how strong the belief in Assumption 2 and 3 is, different criteria are obtained. J mim (X k ) = I(X k ; Y ) J mifs (X k ) = I(X k ; Y ) β X j S I(X k ; X j ) J mrmr (X k ) = I(X k ; Y ) 1 S J jmi (X k ) = I(X k ; Y ) 1 S I(X k ; X j ) X j S [I(X k ; X j ) I(X k ; X j Y )] X j S (14)

15 Unifying criteion A general form of the unifying criterion: J cmi (X k) = I(X k ; Y ) β I(X j ; X k ) + γ I(X j ; X k Y ) (15) BROWN, POCOCK, j S ZHAO AND LUJÁN j S Figure 2: The full space of linear filter criteria, describing several examples from Table 1. Note that all criteria Figure: in this space Theadopt full Assumption space of1. linear Additionally, criteria the γ and β axes represent the criteria belief in Assumptions 2 and 3, respectively. The left hand axis is where

16 Outline 1 Main Contribution 2 Background 3 Main work 4 Experiments 5 Conclusion

17 Criteria Criteria: Stability or Consistency Similarity between different methods Performance in limited and extreme small-sample situations. Stability and Accuracy Tradeoff

18 Criteria Criteria: Stability or Consistency Similarity between different methods Performance in limited and extreme small-sample situations. Stability and Accuracy Tradeoff Classifier: A nearest neighbour classifier (k=3) is used.

19 Stability BROWN, POCOCK, ZHAO AND LUJÁN Figure 3: Kuncheva s Stability Index across 15 data sets. The box indicates the upper/lower quartiles, the horizontal line within each shows the median value, while the dotted crossbars Figure: Stability Comparison indicate the maximum/minimum values. For convenience of interpretation, criteria on the x-axis are ordered by their median value.

20 Similarity FEATURE SELECTION VIA CONDITIONAL LIKEL (a) Kuncheva s Consistency Index. (b) Yu et al s Figure 5: Relations between feature sets generated by different crit Figure: Stability Comparison sets. 2-D visualisation generated by classical multi-dimens

21 Limited and Extreme Small-sample Figure: Limited and Extreme Small-sample

22 Stability Accuracy Tradeoff Figure: Stability Accuracy Tradeoff

23 Outline 1 Main Contribution 2 Background 3 Main work 4 Experiments 5 Conclusion

24 Conclusion Present a unifying framework for information theoretic feature selection via optimization of the conditional likelihood. Clarify the implicit assumptions made when using different feature selection criteria. Conduct empirical study on 9 heuristic mutual information criteria across data sets to analyze their properties.

FEATURE SELECTION VIA JOINT LIKELIHOOD

FEATURE SELECTION VIA JOINT LIKELIHOOD A thesis submitted to the University of Manchester for the degree of Doctor of Philosophy in the Faculty of Engineering and Physical Sciences 2012 By Adam C Pocock