Measures of Diversity in Combining Classifiers

Size: px

Start display at page:

Download "Measures of Diversity in Combining Classifiers"

Winfred Melton
5 years ago
Views:

1 Measures of Diversity in Combining Classifiers Part. Non-pairwise diversity measures For fewer cartoons and more formulas:

2 Random forest :, x, θ k (i.i.d, k=,,l), L is large Strength and correlation: D(x): the class label of x suggested by D Define margin function for a random forest to be mr(x, ω i ) = P θ (D(x)=ω i ) max t i P θ (D(x)=ω t ), and the strength of the set of classifiers to be s = E x,ω [mr(x, ω)] Denote ω s = argmax t i P θ (D(x)=ω ) t and define raw margin function to be rmr(x, ω i, θ) = I (D(x)=ω i ) I (D(x)=ω s ), where I (.) is an indicator function.

3 The probability of error of the ensemble is bounded as follows PE* ρ ( s ) / s (mean) correlation between rmr(d ), i rmr(d k ) (averaged across all pairs of classifiers) strength Although the bound is likely to be loose, it fulfils the same suggestive function for random forests as VC-type bounds do for other types of classifiers.

4 The -class case: mr(x, ω i ) = P θ (D(x)=ω i ), i =, the strength of the set of classifiers is s = E x,ω [mr(x, ω)] /N [ Σ P θ (D(x)=ω ) + Σ P θ (D(x)=ω ) ] - True label ω True label ω The correlation ρ can be calculated as the averaged pairwise correlation between the oracle outputs NB. Both are just estimates!

5 An example: banana-shaped data (gendatb routine from Matlab toolbox PRtools) Training N = 600 data points Testing (a separate set) N = 600 data points The idea was to avoid using OB estimates which anyway simulate estimates on an independent testing set of the same size

6 Simple bagging, L = 50 classifiers Error bound Testing error

7 Q strength correlation

8 L=50, N=600 L=50, N=00

9 true labels guessed labels Is strength related to accuracy? D D D / 0 /3 P θ (D(x)=ω ) x (7/0) - 0 / / /3 /3 (/0) x (0/3) - 0 / /3 /3 P θ (D(x)=ω ) 7/0 4/0 7/0 4/0 /3 6/0 /3 accuracy strength

10 individual testing error averaged individual testing error strength

11 Part : Non-pairwise diversity measures 0. A note on pairwise diversity (ρ) for random forests Measures based on a single data point + averaging (entropy, spread, KW variance) Interrater agreement (kappa for multiple raters) Measures based on difficulties of the data points Relationship with accuracy Open problems

12 Now we look at the whole ensemble of classifiers. Classifier outputs Oracle(binary) 0 0 Continuous-valued (measurement level) Class labels (abstract level) ω 0. ω 0.4 ω ω 4 0. ω ω ω ω ω 3 ω Ordered list of class labels ω ω 8 ω ω 0 ω 9 ω ω ω 0 ω ω ω 9 ω 0 ω ω ω 8 ω 0 ω 0 ω 8 called decision profile (remember for later)

13 Measures based on a single data point (case, instance, example, object, whatever) and subsequently averaged over the whole data set.. Measures based on all data points. For oracle outputs and L = 8 classifiers, are these diverse? No-o-o-o-o-o-o! Nope Yes Yes.

14 ENTROPY (oracle outputs) How do we measure how far we are from the desired pattern of L/ 0 s and L/ s for N objects? [ Σ 0 s, Σ s ] k E = min { } L - N Σ k Consider the output 0 or as a random variable with relative frequencies p 0 = (Σ 0 s) / L and p = (Σ s) / L, respectively. Then the (proper) formula for the entropy of the distribution, averaged across the N data points will be H = - Σ k [ p 0 log p 0 + p log p ] k N [Cunningham Carney, 000]

15 ENTROPY (label outputs) Votes of L=0 classifiers for a single x* ω ω ω 3 ω 4 H = - /N Σ k [ Σ i p i log p i ] k

16 Breiman s Bias-Variance decomposition, 996 Assume that classifier output for a given x* is a random variable with p.m.f. P(ω x*,d),, P(ω c x*,d). The classification error is P(error x*) = Σ j P(ω j x*) P(ω j x*, D) = { P(ω B x*) - P(ω B x*) - Σ j P(ω j x*) P(ω j x*, D) } = [ - P(ω B x*)] + Σ j [P(ω B x*) - P(ω j x*)] P(ω j x*, D) = P B ( x*) + Σ j [P(ω B x*) - P(ω j x*)] P(ω j x*, D) [P(ω B x*) - P(ω s x*)] P(ω s x*, D)] bias + Σ j s [P(ω B x*) - P(ω j x*)] P(ω j x*, D) spread

17 P B ( x*) + [P(ω B x*) - P(ω s x*)] P(ω s x*, D)] (bias) + Σ j s [P(ω B x*) - P(ω j x*)] P(ω j x*, D) (spread) Is the spread related to diversity? An example: If we drew a classifier at random from the distribution P D, guessed true ω ω ω 3 ω 4 ω ω ω 3 ω 4 P(error x) = [0.4-0.] [0.x0.+0.3x0.] = 0.73

18 P(error x) = [0.4-0.] [0.x0.+0.3x0.] = Take majority vote. This means decide always ω s for x*. guessed true ω ω ω 3 ω 4 ω ω ω 3 ω 4 P(error x) = [0.4-0.].0 =

19 KW variance (label outputs) [Kohavi Wolpert, 996, Bias plus variance decomposition for zero-one loss functions] The c-class case: P(error x) = bias (x)+variance(x)+noise (x) bias (x) ½ Σ ω (P true (ω x) - P guessed (ω x)) variance(x) ½ ( - Σ ω (P guessed (ω x)) noise (x) ½ ( - Σ ω (P true (ω x))

20 bias (x) = ½ Σ ω (P true (ω x) - P guessed (ω x)) ½ (0.3-0.) + (0.4-0.) + ( ) = 0.03 guessed true ω ω ω 3 ω 4 ω ω ω 3 ω 4 variance(x) = ½ ( - Σ ω (P guessed (ω x)) ) ½ [ ((0.) + (0.) + (0.4) + (0.3) )] = 0.35 noise (x) = ½ ( - Σ ω (P true (ω x)) ) ½ [ ((0.3) + (0.) + (0.) + (0.4) )] = 0.35

21 KW variance (oracle outputs) Consider again the output 0 or as a random variable with relative frequencies p 0 = (Σ 0 s) / L and p = (Σ s) / L, respectively. Then the variance is variance(x) = ½ ( - (p 0 ) - (p ) ) Averaging across the whole data set, KW = /(N L ) Σ k [ (Σ 0 s) (Σ s) ] k Curiously, KW and the averaged pairwise disagreement measure are related through KW = (L-)/(L) D av

22 Measures based on a single data point (case, instance, example, object, whatever) and subsequently averaged over the whole data set.. Measures based on all data points. Interrater agreement, kappa, (oracle outputs) L KW k = - Σ k [ (Σ 0 s) (Σ s) ] k N L (L - ) p ( - p) Number of data points Number of classifiers Averaged individual accuracy

23 Measure of difficulty θ [Hansen Salamon, 990] Define a random variable X = proportion of classifiers which correctly classify a randomly drawn sample x. Let L = 7. Number of points misclassified by all 7 Number of points recognized by any 4 Number of points recognized by all 7

24 L = 7, p = % 60 independent identical negatively dependent 00% 7%

25 measure of diversity θ = Var(X) independent identical diverse θ = θ = 0.40 θ = 0.004

26 Generalized diversity [Partridge Krzanowski, 997] Define a random variable Y = proportion of classifiers which misclassify a randomly drawn sample x. (Y = X defined before) Number of points recognized by all 7 Number of points misclassified by all 7 Denote by p i the probability that Y = i / L, and by p(k) the probability that k randomly chosen classifiers will fail on a randomly drawn x.

27 p() = Σ i p i i / L (the probability of single classifier failing) p() = Σ i p i i (i - )/ (L (L - )) (the probability that two randomly chosen classifiers will fail together) GD = p()/p() Coincidence failure diversity CFD = 0, if p 0 =, /(- p 0 ) Σ i p i (L - i)/ (L - ), if p 0 <

28 Relationship between diversity and accuracy Correlations between the improvement on the single best classifier and some diversity measures (WBC) Q ρ Dis DF κ θ GD CFD MAJ NB BKS WER MAX AVR PRO DT

29 Relationship between diversity measures pairwise non-pairwise DF CFD Q, ρ, E, KW, κ, θ GD non-symmetrical

30 Open problems How to narrow down the study? (Use a specific methodology for building the ensemble) Some theory would not go amiss. Diversity for label outputs and continuous-valued outputs might lead somewhere. The difficulty comes from the fact that the output of the classifiers are vectors ω ω ω 3 ω 4 D D D similarity between distributions (pairwise)

Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy

Machine Learning, 51, 181 207, 2003 c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands. Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy LUDMILA