Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/ Bulletin Board "Is there a book available on boosting?" Stacking "eta-learn" which classifier does well where Error-correcting codes going fro binary to ulti-class probles
Why Cobine Classifiers? Prob of error Cobine several classifiers to produce a ore accurate single classifier If C 2 and C 3 are correct where C 1 is wrong, etc, ajority vote will do better than each C i individually Suppose each C i has error rate p<0.5 errors of different C i are uncorrelated Then Pr(r out of n classifiers are wrong) = n B r p r (1-p) n-r Pr(ajority of n classifiers are wrong) = right-half of binoial distribution is sall if: n is large p is sall 1 2 r=nuber of classifers n
Bagging "Bootstrap aggregation" Bootstrap estiation - generate data set by randoly selecting fro training set with replaceent (soe points ay repeat) repeat B ties use as estiate the average of individual estiates Bagging generate B equal size training sets each training set is drawn randoly, with replaceent, fro the data is used to generate a different coponent classifier f i usually using sae algorith (e.g. decision tree) final classifier decides by voting aong coponent classifiers Leo Breian, 1996.
Bagging (contd) Suppose there are k classes Each f i (x) predicts 1 of the classes Equivalently, f i (x) = (0, 0,..., 0, 1, 0,..., 0) Define f bag (x) = (1/B) B i=1 f i (x) = (p 1 (x),..., p k (x)), p j (x) = proportion of f i predicting class j at x Bagged prediction is arg ax k f bag (x) Reduces variance always (provable) for squared-error not always for classification (0/1 loss) In practice usually ost effective if classifiers are "unstable" - depend sensitively on training points. However ay lose interpretability a bagged decision tree is not a single decision tree
Boosting Generate the coponent classifiers so that each does well where the previous ones do badly Train classifier C 1 using (soe part of) the training data Train classifier C 2 so that it perfors well on points where C 1 perfors badly Train classifer C 3 to perfor well on data classified badly by C 1 and C 2, etc. Overall classifier C classifies by weighted voting aong the coponent classifiers C i The sae algorith is used to generate each C i - only the data used for training changes
AdaBoost "Adaptive Boosting" Give each training point (x i, y i =!1) ) in D a weight w i (initialized uniforly) Repeat: Draw a training set D at rando fro D according to the weights w i Generate classifier C using training set D Measure error of C on D Increase weights of isclassified training points Decrease weights of correctly classified points Overall classification is deterined by C boost (x) = Sign( C (x)), where easures the "quality" of C Terinate when C boost (x) has low error
AdaBoost (Details) Initialize weights uniforly: w i Repeat for =1,2,..., M 1 = 1/N (N=training set size) Draw rando training set D fro D according to weights w i Train classifier C using training set D Copute err = Pr i~d [C (x i )! y i ] error rate of C on (weighted) training points Copute =0.5 log((1-err )/err ) = 0 when err = 0.5 -> as err ->0 * w i = w i exp( ) = w i ª(1-err(1-err )/err if x i is incorrectly classified w i exp(- ) = w i ªerr /(1-err ) if x i is correctly classified +1 * w i = w i /Z * +1 Z = i w i is a noralization factor so that i w i =1 Overall classification is deterined by C boost (x) = Sign( C (x))
Theory If: each coponent classifier C is a "weak learner" perfors better than rando chance (err <0.5) Then: the TRAINING SET ERROR of C boost can be ade arbitrarily sall as M (the nuber of boosting rounds) -> Proof (see Later) Probabilistic bounds on the TEST SET ERROR can be obtained as a function of training set error, saple size, nuber of boosting rounds, and "coplexity" of the classifiers C If Bayes Risk is high, it ay becoe ipossible to continually find C which perfor better than chance. "In theory theory and practice are the sae, but in practice they are different"
Practice Use an independent test set to deterine stopping point Boosting perfors very well in practice Fast Boosting decision "stups" is copetitive with decision trees Test set error ay continue to fall even after training set error=0 Does not (usually) overfit Soeties vulnerable to outliers/noise Result ay be difficult to interpret "AdaBoost with trees is the best off-the-shelf classifier in the world" - Breian, 1996. test set error training set error
History Robert Schapire, 1989 Weak classifier could be boosted Yoav Freund, 1995 Boost by cobining any weak classifiers Required bound on error rate of weak classifier Freund & Schapire, 1996 AdaBoost - adapts weights based on error rate of weak classifier Many extensions since then Boosting Decision Trees, Naive Bayes,... More robust to noise Iproving interpretability of boosted classifier Incorporating prior knowledge Extending to ulti-class case "Balancing between Boosting and Bagging using Buping"...
Proof Clai: If err <0.5 for all, then Training Set Error of C boost ->0 as M-> Note: y i C (x i ) = 1 if x i is correctly classified by C = -1 if x i is incorrectly classified by C, siilarly for C boost (x) = sign( C (x)) Training Set Error of classifer C boost (x) is err boost = {i:c boost (x i )! y i } /N C boost (x i )! y i if and only if y i C (x)<0 if and only if -y i C (x))>0 Hence C boost (x i )! y i e exp(-y i C (x))>1 so err boost < [ i exp(-y i C (x))]/n +1 By definition, w i = w i exp(-y i C (x)) /Z +1 So exp(-y i C (x)) ) = Z w i /w i Now insert the "su" into the exponential: exp(-y i C (x)) = exp(-y i C (x)) +1 = Z w i /w i M+1 1 = w i /w i Z M+1 = Nw i Z
Proof (continued) Thus [ i exp(-y i C (x))]/n = i w i M+1 Z = Z M+1 because i w i =1 (having been noralized by Z M ) Nothing has been said so far about the choice of Set =0.5 log((1-err )/err ) * Then w i = w i ª(1-err )/err if x i is incorrectly classified w i ªerr /(1-err ) if x i is correctly classified To noralize, set Z = i w i = i w i * [err (ª(1-err(1-err )/err ) + (1-err )ªerr /(1-err )] = i w i [ªerr (1-err ) + ªerr (1-err )] = 2ªerr2 (1-err ) because i w i =1 So err boost < [ i exp(-y i C (x))]/n = Z = 2ªerr (1-err ) NOTE: D, H & S, pg 479, says err boost = 2ªerr (1-err )
Proof (continued) Let = 0.5 - err > 0 for all is the "edge" of C over rando guessing Then 2ªerr2 (1-err ) = 2ª(0.5- )(0.5+ ) 2 = ª1-4 2 So err boost < ª1-4 2 < (1-2 ) since (1-x) 0.5 = 1-0.5x-... 2 < exp(-2-2 ) since 1+x < exp(x) 2 = exp(-2 ) If: > >0 for all Then err boost < exp(-2 2 ) = exp(-2m 2 ) which tends to zero exponentially fast as M->
Why Boosting Works "The success of boosting is really not very ysterious." - Jeroe Friedan, 2000. Additive odels: f(x) = b(x; ) Classify using Sign(f(x)) b = "basis" function paraetrized by are weights Exaples: neural networks b = activation function, = input-to-hidden weights support vector achines b = kernel function, appropriately paraetrized boosting b = weak classifier, appropriately paraetrized
Fitting Additive Models To fit f(x) = b(x; ), usually, are found by iniizing a loss function (e.g. squared error) over the training set Forward Stagewise fitting: Add new basis functions to the expansion one-by-one Do not odify previous ters Algorith: f 0 (x) = 0 For =1 to M: Find, by in, i L(y i,f -1 (x)+ b(x i ; )) Set f (x) = f -1 (x) + b(x; ) AdaBoost is Forward Stagewise fitting applied to the weak classifier with an EXPONENTIAL loss function
AdaBoost (Derivation) L(y,f(x)) = exp(-yf(x)) exponential loss,c = arg in,c i exp(-y i (f -1 (x i )+ C(x i ))) = arg in,c i exp(-y i (f -1 (x i )))exp(- y i C(x i )) = arg in,c i w i exp(- y i C(x i )) where w i = exp(-y i (f -1 (x i ))) w i depends on neither nor C. Note: i w i exp(- y i C(x i )) = e - yi=c(xi) w i +e yi!c(xi) w i = e - i w i +(e -e - ) i w i Ind(y i!c(x i )) For >0, pick C = arg in C i w i Ind(y i!c(x i )) = arg in C err
AdaBoost (Derivation) (continued) Substitute back: yields e - i w i +(e -e - )err a function of only arg in e - i w i +(e -e - )err can be found differentiate, etc - Exercise! giving =0.5log((1-err )/err The odel update is: f (x) = f -1 (x) + C (x i ) +1 w i = exp(-y i (f (x i ))) = exp(-y i (f -1 (x i ) + C (x i ))) = exp(-y i (f -1 (x i )))exp(-y i C (x i )) = w i exp(- y i C (x i )) deriving the weight update rule.
Exponential Loss L 1 (y,f(x)) = exp(-yf(x)) exponential loss L 2 (y,f(x)) = Ind(yf(x)<0) 0/1 loss L 3 (y,f(x)) = (y-f(x)) 2 squared error 1 L 2 L 1 L 3 0 1 yf(x) (= unnoralized argin) Exponential loss puts heavy weight on exaples with large negative argin These are difficult, atypical, training points - boosting is sensitive to outliers
Boosting and SVMs The argin of (x i, y i ) is (y i C (x i ))/ = y i ( *C(x i ))/y y lies between -1 and 1 >0 if and only if x i is classified correctly Large argins on the training set yield better bounds on generalization error It can be argued that boosting attepts to (approxiately) axiize the iniu argin ax in i y i ( *C(x i ))/y y sae expression as SVM, but 1-nor instead of 2-nor
Stacking Stacking = "stacked generalization" Usually used to cobine odels l 1,..., l r of different types e.g. l 1 =neural network, l 2 =decision tree, l 3 =Naive Bayes,... Use a "eta-learner" L to learn which classifier is best where Let x be an instance for the coponent learners Training instance for L is of the for (l 1 (x),..., l r (x)), l i (x) = class predicted by classifier l i OR (l 11 (x),..., l 1k (x),..., l r1 (x)..., l rk (x)), l ij (x) = probability x is in class j according to classifier l i
Stacking (continued) What should class label for L be? actual label fro data ay prefer classifiers that overfit use a "hold-out" data set which is not used to train the l 1,..., l r wastes data use cross-validation when x occurs in the test set, use it as a training instance for L coputationally expensive Use siple linear odels for L David Wolpert, 1992.
Error-correcting Codes Using binary classifiers to predict ulti-class proble Generate one binary classifier C i for each class vs every other class class C 1 C 2 C 3 C 4 class C 1 C 2 C 3 C 4 C 5 C 6 C 7 a 1 0 0 0 a 1 1 1 1 1 1 1 b 0 1 0 0 b 0 0 0 0 1 1 1 c 0 0 1 0 c 0 0 1 1 0 0 1 d 0 0 0 1 d 0 1 0 1 0 1 0 Each binary classifier C i predicts the i th bit LHS: Predictions like "1 0 1 0" cannot be "decoded" RHS: Predictions like "1 0 1 1 1 1 1" are class "a" (C 2 ade a istake)
Haing Distance Haing distance H between codewords = nuber of single-bit corrections needed to convert one into the other H(1000,0100) = 2 H(1111111,0000111) = 4 (d-1)/2 single-bit errors can be corrected if d=inuu Haing distance between any pair of code-words LHS: d=2 No error-correction RHS: d=4 Corrects all single-bit errors To Dietterich and Ghulu Bakiri, 1995.