Lecure 21 Ensamble mehods: Bagging and Boosing Milos Hauskrech milos@cs.pi.edu 5329 Senno Square Ensemble mehods Mixure of expers Muliple base models (classifiers, regressors), each covers a differen par (region) of he inpu space Commiee machines: Muliple base models (classifiers, regressors), each covers he complee inpu space Each base model is rained on a slighly differen rain se Combine predicions of all models o produce he oupu Goal: Improve he accuracy of he base model Mehods: Bagging Boosing Sacking (no covered) 1
Bagging (Boosrap Aggregaing) Given: Training se of N examples A class of learning models (e.g. decision rees, neural neworks, ) Mehod: Train muliple (k) models on differen samples (daa splis) and average heir predicions Predic (es) by averaging he resuls of k models Goal: Improve he accuracy of one model by using is muliple copies Average of misclassificaion errors on differen daa splis gives a beer esimae of he predicive abiliy of a learning mehod Bagging algorihm Training In each ieraion, =1, T Randomly sample wih replacemen N samples from he raining se Train a chosen base model (e.g. neural nework, decision ree) on he samples Tes For each es example Sar all rained base models Predic by combining resuls of all T rained models: Regression: averaging Classificaion: a majoriy voe 2
Tes examples Simple Majoriy Voing H 1 H 2 H 3 Final Class yes Class no Analysis of Bagging Expeced error= Bias+Variance Expeced error is he expeced discrepancy beween he esimaed and rue funcion E fˆ X E f X Bias is squared discrepancy beween averaged esimaed and rue funcion fˆ X E f X E 2 Variance is expeced divergence of he esimaed funcion vs. is average value E fˆ X E fˆ X 2 2 3
When Bagging works? Under-fiing and over-fiing Under-fiing: High bias (models are no accurae) Small variance (smaller influence of examples in he raining se) Over-fiing: Small bias (models flexible enough o fi well o raining daa) Large variance (models depend very much on he raining se) Averaging decreases variance Example Assume we measure a random variable x wih a N(, 2 ) disribuion If only one measuremen x 1 is done, The expeced mean of he measuremen is Variance is Var(x 1 )= 2 If random variable x is measured K imes (x 1,x 2, x k ) and he value is esimaed as: (x 1 +x 2 + +x k )/K, Mean of he esimae is sill Bu, variance is smaller: [Var(x 1 )+ Var(x k )]/K 2 =K 2 / K 2 = 2 /K Observe: Bagging is a kind of averaging! 4
When Bagging works Main propery of Bagging (proof omied) Bagging decreases variance of he base model wihou changing he bias!!! Why? averaging! Bagging ypically helps When applied wih an over-fied base model High dependency on acual raining daa I does no help much High bias. When he base model is robus o he changes in he raining daa (due o sampling) Boosing Mixure of expers One exper per region Exper swiching Bagging Muliple models on he complee space, a learner is no biased o any region Learners are learned independenly Boosing Every learner covers he complee space Learners are biased o regions no prediced well by oher learners Learners are dependen 5
Boosing. Theoreical foundaions. PAC: Probably Approximaely Correc framework (-) soluion PAC learning: Learning wih pre-specified error and confidence parameers he probabiliy ha he misclassificaion error is larger han is smaller han P ( ME ( c ) ) Accuracy (1- ): Percen of correcly classified samples in es Confidence (1- ): The probabiliy ha in one experimen some accuracy will be achieved P ( Acc ( c) 1 ) (1 ) PAC Learnabiliy Srong (PAC) learnabiliy: There exiss a learning algorihm ha efficienly learns he classificaion wih a pre-specified accuracy and confidence Srong (PAC) learner: A learning algorihm P ha given an arbirary classificaion error (< 1/2), and confidence (<1/2) Oupus a classifier ha saisfies his parameers In oher words gives: classificaion accuracy > (1-) confidence probabiliy > (1- ) And runs in ime polynomial in 1/, 1/ Implies: number of samples N is polynomial in 1/, 1/ 6
Weak Learner Weak learner: A learning algorihm (learner) W ha gives: a classificaion accuracy > 1- o wih probabiliy >1- o For some fixed and unconrollable error o (<1/2) confidence o (<1/2) and his on an arbirary disribuion of daa enries Weak learnabiliy=srong (PAC) learnabiliy Assume here exiss a weak learner i is beer ha a random guess (> 50 %) wih confidence higher han 50 % on any daa disribuion Quesion: Is he problem also PAC-learnable? Can we generae an algorihm P ha achieves an arbirary (-) accuracy? Why is imporan? Usual classificaion mehods (decision rees, neural nes), have specified, bu unconrollable performances. Can we improve performance o achieve any pre-specified accuracy (confidence)? 7
Weak=Srong learnabiliy!!! Proof due o R. Schapire An arbirary (-) improvemen is possible Idea: combine muliple weak learners ogeher Weak learner W wih confidence o and maximal error o I is possible: To improve (boos) he confidence To improve (boos) he accuracy by raining differen weak learners on slighly differen daases Boosing accuracy Training Disribuion samples Learners H 1 H 2 H 3 Correc classificaion Wrong classificaion H 1 and H 2 classify differenly 8
Boosing accuracy Training Sample randomly from he disribuion of examples Train hypohesis H 1. on he sample Evaluae accuracy of H 1 on he disribuion Sample randomly such ha for he half of samples H 1. provides correc, and for anoher half, incorrec resuls; Train hypohesis H 2. Train H 3 on samples from he disribuion where H 1 and H 2 classify differenly Tes For each example, decide according o he majoriy voe of H 1, H 2 and H 3 Theorem If each hypohesis has an error < o, he final voing classifier has error < g( o ) =3 o2-2 o 3 Accuracy improved!!!! Apply recursively o ge o he arge accuracy!!! 9
Theoreical Boosing algorihm Similarly o boosing he accuracy we can boos he confidence a some resriced accuracy cos The key resul: we can improve boh he accuracy and confidence Problems wih he heoreical algorihm A good (beer han 50 %) classifier on all disribuions and problems We canno ge a good sample from daa-disribuion The mehod requires a large raining se Soluion o he sampling problem: Boosing by sampling AdaBoos algorihm and varians AdaBoos AdaBoos: boosing by sampling Classificaion (Freund, Schapire; 1996) AdaBoos.M1 (wo-class problem) AdaBoos.M2 (muliple-class problem) Regression (Drucker; 1997) AdaBoosR 10
AdaBoos Given: A raining se of N examples (aribues + class label pairs) A base learning model (e.g. a decision ree, a neural nework) Training sage: Train a sequence of T base models on T differen sampling disribuions defined upon he raining se (D) A sample disribuion D for building he model is consruced by modifying he sampling disribuion D -1 from he (-1)h sep. Examples classified incorrecly in he previous sep receive higher weighs in he new daa (aemps o cover misclassified samples) Applicaion (classificaion) sage: Classify according o he weighed majoriy of classifiers AdaBoos raining. Training daa Disribuion Learn Tes D 1 Model 1 Errors 1 D 2 Model 2 Errors 2 D T Model T Errors T 11
AdaBoos algorihm Training (sep ) Sampling Disribuion D D ( i ) - a probabiliy ha example i from he original raining daase is seleced D 1 ( i ) 1 / N for he firs sep (=1) Take K samples from he raining se according o D Train a classifier h on he samples Calculae he error of h : D ( i) i: h ( x i ) y i Classifier weigh: /( 1 ) New sampling disribuion D ( ) h ( x i ) y i i D 1 ( i ) Z 1 oherwise Norm. consan AdaBoos. Sampling Probabiliies Example: - Nonlinearly separable binary classificaion - NN as week learners 12
AdaBoos: Sampling Probabiliies AdaBoos classificaion We have T differen classifiers h weigh w of he classifier is proporional o is accuracy on he raining se w log( 1 / ) log (1 ) / /( 1 ) Classificaion: For every class j=0,1 Compue he sum of weighs w corresponding o ALL classifiers ha predic class j; Oupu class ha correspond o he maximal sum of weighs (weighed majoriy) h final (x) arg max j w : h ( x ) j 13
Two-Class example. Classificaion. Classifier 1 yes 0.7 Classifier 2 no 0.3 Classifier 3 no 0.2 Weighed majoriy yes 0.7-0.5 = + 0.2 The final choose is yes + 1 Wha is boosing doing? Each classifier specializes on a paricular subse of examples Algorihm is concenraing on more and more difficul examples Boosing can: Reduce variance (he same as Bagging) Bu also o eliminae he effec of high bias of he weak learner (unlike Bagging) Train versus es errors performance: Train errors can be driven close o 0 Bu es errors do no show overfiing Proofs and heoreical explanaions in a number of papers 14
Boosing. Error performances 0.4 0.35 Training error Tes error Single-learner error 0.3 0.25 0.2 0.15 0.1 0.05 0 0 2 4 6 8 10 12 14 16 Model Averaging An alernaive o combine muliple models can be used for supervised and unsupervised frameworks For example: Likelihood of he daa can be expressed by averaging over he muliple models P ( D ) Predicion: N i 1 N P ( y x ) i 1 P ( D M m P ( y x, M i ) P ( M i m i ) m ) P ( M m i ) 15