X={x ij } φ ik =φ k (x i ) Φ={φ ik } Lecture 11: Information Theoretic Methods. Mutual Information as Information Gain. Feature Transforms

Size: px

Start display at page:

Download "X={x ij } φ ik =φ k (x i ) Φ={φ ik } Lecture 11: Information Theoretic Methods. Mutual Information as Information Gain. Feature Transforms"

Denis Norton
5 years ago
Views:

1 Lecture 11: Information Theoretic Methods Isabelle Guyon Mutual Information as Information Gain Book Chapter 6 and Feature Transforms Information, Entropy m x i n ={x ij } φ ik =φ k (x i ) For an event happening with probability p, the quantity of information is log p. If the log is of base 2, the unit is a bit. n (n <n) The average quantity of information over all events is the entropy: m Φ={φ ik } m H = - Σ p log p 1

2 Conditional Entropy Mutual Information Our case: =class labels, Φ feature representation. Entropy: H() = - Σ y p(y) log p(y) Conditional entropy: Information Gain: The amount of class uncertainty reduction by observing Φ is: MI(, Φ) = H() H( Φ) H( Φ)= - f p(f) Σ y p(y f) log p(y f) df MI(, Φ) = Σ y f p(y, f) log p(y, f) p(y) p(f) df Mutual Information MI Estimation H( ) H(, ) MI(,) H( ) P() P()*P() P(, ) pxy=[4/16, 6/16; 5/16, 1/16] px=sum(pxy) py=sum(pxy, 2) pxpy=(py*px) MI=sum(sum(pxy.*log2(pxy./pxpy))) pxy = px = H() H() H(,) = H() + H() MI(,) MI(,) = H()-H( ) = H()-H( ) P() Possible normalization by (H()+H())/2 py = pxpy = MI =

3 Tree Classifiers CART (Breiman, 1984) or C4.5 (Quinlan, 1993) Information Gain H before = -11/19 log(11/19) -8/19 log(8/19) = 0.98 f 2 All the data f 2 All the data Choose f 1 Choose f 2 f 1 At each step, choose the feature that reduces entropy most. Work towards node purity. Choose f 2 11/19 8/19 H left = -4/11 log(4/11) 7/11 log(7/11) = 0.94 f 1 H right = -7/8 log(7/8) 1/8 log(1/8) = 0.54 IG = H before (11/19 H left + 8/19 H right ) = = 0.2 IG = MI Iris Data (Fisher, 1936/Chapter 1) Linear discriminant Tree classifier H before = H() H left = H( f 2 =-1) H right = H( f 2 =1) IG = H before (p(f 2 =-1) H left + p(f 2 =1) H right ) = H() (p(f 2 =-1) H( f 2 =-1) + p(f 2 =1) H( f 2 =1)) = H() H( f 2 ) = MI(,f 2 ) setosa virginica versicolor Gaussian mixture Kernel method (SVM) 3

4 MI for Feature Selection Forward Selection with MI Random Forest (Breiman, Cutler, 2002) Feature ranking Forward selection Feature subset selection (Markov Blankets, scaling factors, shrinkage) Fleuret, Practical only for binary features. Select a first feature?(1) with maximum MI with the target. For each remaining feature i and each previously selected feature?(j), compute the conditional mutual information: CMI( i,?(j) )= Σ?(j) P(?(j) ) MI( i,?(j) ) Select the feature with maximum CMI. Transmission Channels Transmission Channels Noisy channel Channel capacity = max p(x) MI(, ) = (second Shannon theorem) highest achievable information transmission rate without error. 4

5 Coding MI and Error Rate Encoder Noisy channel Fano, 1961 Code efficiency = length(code) / length(original signal) Theorems (Shannon, 1948): Errorless transmission is achievable with sufficiently inefficient (well chosen) codes; the efficiency is bounded by the capacity. For a chosen error rate, one can find a shorter code. Hellman and Raviv (1970) Feder and Merhav (1994) Noisy Channel in PR Noisy Channel in PR This is a four 4 Category Encoder Ideal pattern Noisy channel Real pattern This is a four 4 Encoder Category Ideal pattern Noisy channel Real pattern Decoder Predicted category Preprocessor Features Φ Decoder Predicted category 5

6 Information Bottleneck Tishby et al., 1999 Φ Information Bottleneck Methods maximize MI(, Φ) = MI(Φ,) but minimize MI(, Φ) Gradient Descent Computational Tricks (1/4) Class labels: y High dim. data: x f(q, x) Low dim. Feature vector: f Mutual Information MI(f, y) Feature x construction/ f Posterior p(y f) probability (dim n) selection (dim n ) estimation (dim 1) f(q, x) 1 (1, n θ ) (1, n ) (n, n θ ) MI/ q = MI/ f p(y f) appears here f/ q Gradient: MI/ q Torkkola, 2001 (1, n ) (n, n) (n, 1) For example f(w, x) = W x T (linear model) f/ w=x 6

7 Computational Tricks (2/4) Computational Tricks (3/4) MI(, Φ) = Σ y φ p(y, f) log p(y, f) p(y) p(f) df 2 MI(, Φ) = Σ y φ (p(y, f) - p(y) p(f)) 2 df Justification: Renyi entropy H() = - log Σ y p(y) 2 3 Use densities that are mixtures of Gaussians. Examples of Suitable Densities Parzen Windows (Rosenblatt 1956, Parzen 1962) Gaussian mixture: p(f y) = G(f- f y, Σ y ) p(f) = Σ y p(y) p(f y) =Σ y (m y /m) G(f- f y, Σ y ) p(x) = (1/m)Σ i=1:m G(f- f i, σ m ) Parzen windows: one per class σ m = σ/sqrt(m) p(f y) = (1/m y ) Σ c y G(f- f c, σ) p(f y) = (1/m) Σ i G(f- f i, σ) the same for all one per example From Duda and Hart,

8 Computational Tricks (4/4) Plugging In MI(, Φ) = Σ y φ (p(y, f) - p(y) p(f)) 2 df 4 MI(, Φ) = V in + V all - V bw V in = function ( G(f ci - f cj, 2σ) ) MI(, Φ) = Σ y φ p(y, f) 2 df V in V all = function ( G(f i - f j, 2σ) ) + Σ y φ p(y) 2 p(f) 2 df V all V bw = function ( G(f ci - f j, 2σ) ) - 2 Σ y φ p(y, f) p(y) p(f) df V bw Information Forces Max MI by hill climbing (gradient descent): q(t+1) = q(t) + η MI/ q = q(t) + η MI/ f f/ q = q(t) + η [ V in / f + V all / f - V bw / f] f/ q Force = - gradient (Energy) Energy - MI V in / f Cohesion within classes V all / f Overall cohesion - V bw / f Repulsion to other classes Feature Construction Torkkola, 2001 Example 1:Three classes in 3-dimensional space. Artificial data. 1.a f = linear, p(y f) = Parzen windows 1.b f = RBF, p(y f) = Parzen windows Example 2:Three classes in 12-dimensional space. This is the oil-pipeflowdata from NCRG at Aston University. 2.a f = linear, p(y f) = Parzen windows 2.b f = linear, p(y f) = Gaussian mixture Example 3: Six classes in 36-dimensional space. This is the Landsat satellite image database from Statlogproject. f = linear, p(y f) = Parzen windows 8

9 Application to Feature Selection Φ i = σ i x i scaling factors, σ i 0 s * = argmax s MI(Φ, ) subject to: Σ i σ 2 i = 1 (shear) s * = argmin s MI(, Φ) - β MI(Φ, ) Conclusion MI characterizes the dependency between random variables: Feature ranking Forward selection with CMI MI is also an Information Gain Tree classifiers Random Forests MI characterizes the channel capacity and the rate of distorsion Information bottleneck methods min q MI(, Φ) - β MI(Φ, ) Madelon (continued) Exercise Class Madelon How to review a paper? Exercises on IT methods my_classif=svc({'coef0=1', 'degree=0', 'gamma=1', 'shrinkage=1'}); my_model=chain({probe(relief,{'p_num=2000', 'pval_max=0'}), standardize, my_classif}) f_max=20' Earn 1 point with a result better than the baseline method (testber<7.33 or [testber=7.33 and feat_num<20 (4%)]). Earn a total of 2 points with a result testber<6.79% or [testber=6.79% and feat_num<17 (3.4%)]. 9

10 Homework Review form Write a review of a paper in the Feature Extraction book (the paper you presented or will present or any paper/chapter you want). Questions: What is the goal of a review? What is the goal of publishing? A. Eliminating criteria (a zero means rejection) 1. Scope 2. Novelty 3. Usefulness 4. Sanity B. Other fundamental criteria (a low grade implies major revisions) 1. Quantity 2. Reproducibility 3. Demonstration 4. Comparison Review Form (continued) Review Form (end) C. Forgivable problems of contents (minor revisions may be required) 1. Completeness 2. Take-aways 3. Bibliography 4. Outlook D. Bonus reproducibility criteria (give additional credit) 1. Data availability 2. Code availability E. Message delivery (major/minor revisions may be required) 1. Readability 2. Notations 3. Figures 4. Formalism 5. Density 6. Language 10

11 MI and Pearson Correlation Exercise: prove MI=-1/2 log(1-r 2 ) P() MI(, ) = -(1/2) log(1-r 2 ) P() Prove: Entropy Gaussian source: H() = ½ log(2πe)σ 2 Solution of one-variable least-square regression y=ax: a=σ xy /σ x 2. We will call a the sol. of x=a y. Pearson correlation coefficient: R 2 =aa Residual error σ 2 =E(y-ax) 2 =(1-R 2 )σ y 2. =function()+n, N is random noise MI(,)=H()-H(N) MI(,) H()-H(N) For a Gaussian channel with Gaussian source: MI(,)=-½ log σ 2 /σ y 2. Conclude the proof. Markov Blankets Exercise (Markov Blankets) Notations: All features, Features selected Φ, complement Φ, =[Φ, Φ ]. Definition: A Markov Blanket is a subset of features Φ such that: P(, Φ) = P( Φ) P( Φ) We write - Φ ( is independent of given Φ) If Φ is a MB, prove P( Φ) = P( ) for all assigments of values to Φ. We define DMI(Φ) = E p() D KL (P( ) P( Φ)) Prove: DMI(Φ)=E p(,) P( )/P( Φ) We define A=log[P(,)/(P()P())] and B=log [P( Φ,)/(P( Φ)P())], prove: A-B=P( )/P( Φ) Σ {Φ} P( Φ,)B= Σ {} P(,)B DMI(Φ) = MI(,) MI(Φ,) Define an optimization problem to find a minimum MB. How can you relate it to the information bottleneck approach? 11

12 Complement of exercise What are the similarities and differences between the bottleneck IT methods and the Bayesian/regularized risk methods? Can you suggest hybrid approaches? Can you suggest forward selection and backward elimination IT methods using the scaling factor approach? 12

Variable selection and feature construction using methods related to information theory

Outline Variable selection and feature construction using methods related to information theory Kari 1 1 Intelligent Systems Lab, Motorola, Tempe, AZ IJCNN 2007 Outline Outline 1 Information Theory and