Boostrapaggregatng (Baggng) An ensemble meta-algorthm desgned to mprove the stablty and accuracy of machne learnng algorthms Can be used n both regresson and classfcaton Reduces varance and helps to avod overfttng Usually appled to decson trees, though t can be used wth any type of method 1
An Asde: Ensemble Methods In a nutshell: A combnaton of multple learnng algorthms wth the goal of achevng better predctve performance than could be obtaned from any of these classfers alone A meta-algorthm that can be consdered to be, n tself, a supervsed learnng algorthm snce t produces a sngle hypothess Tend to work better when there s dversty among the models Examples: Baggng Boostng Bucket Stackng 2
An Asde: Ensemble Methods Tradtonal: Ensemble Method: S S L 1 Dfferent tranng sets and/or learnng algorthms L 1 L 2 L 3 L 4 L 5 L 6 (x,? ) h 1 h 1 h 2 h 3 h 4 h 5 h 6 (x, y = h 1 (x)) (x,? ) h = f(h 1,, h 6 ) 3 (x, y = h 1 (x))
The dea: 1. Create N boostrap samples {S 1,, S N } of S as follows: For each S, randomly draw S examples from S wth replacement 2. For each = 1,, N h = Learn(S ) Back to Baggng 1. Output H =< h 1,, h N, majortyvote > 4
Most notable benefts 1. Surprsngly compettve performance & rarely overfts 2. Is capable of reducng varance of consttuent models 3. Improves ablty to gnore rrelevant features Remember: error(x) = nose(x) + bas(x) + varance(x) 5 Varance: how much does predcton change f we change the tranng set?
Baggng Example 1 6
Baggng Example 2 7
Baggng Example 3 (1) 8
Baggng Example 3 (2) Accuracy: 100% 9
How does baggng mnmze error? Ensemble reduces the overall varance Let f(x) be the target value of x, h 1 to h n be the set of base hypothess, and h avg be the predcton of the base hypotheses Error h, x = f x h x 2 Squared error Is there any relaton between h avg and varance? Yes 10
How does baggng mnmze error? Error h, x = f x h x 2 Error h avg, x = 1n Error h,x n n By the above, we see that the squared error of the average hypothess equals the average squared error of the base hypotheses mnus the varance of the base hypotheses = 1n h x h avg x 2 11
Stabltyof Learn A learnng algorthm s unstable f small changes n the tranng data can produce large changes n the output hypothess (otherwse stable) Clearly baggng wll have lttle beneft when used wth stable base learnng algorthms (.e., most ensemble members wll be very smlar) Baggng generally works best when used wth unstable yet relatvely accurate base learners 12
Baggng Summary Works well f the base classfers are unstable (complement each other) Increased accuracy because t reduces the varance of the ndvdual classfer Does not focus on any partcular nstance of the tranng data Therefore, less susceptble to model over-fttng when appled to nosy data 13
Boostng Key dfferences wth respect to baggng: It s teratve: Baggng: Each ndvdual classfer s ndependent Boostng: Looks at the errors from prevous classfers to decde what to focus on for the next teraton Successve classfers depend on ther predecessors Key dea: place more weght on hard examples (.e., nstances that were msclassfed on prevous teratons) 14
Hstorcal Notes The dea of boostng began wth a learnng theory queston frst asked n the late 80 s The queston was answered n 1989 by Robert Shapre resultng n the frst theoretcal boostng algorthm Shapre and Freund later developed a practcal boostng algorthm called Adaboost Many emprcal studes show that Adaboost s hghly effectve(very often they outperform ensembles produced by baggng) 15
Boostng An teratve procedure to adaptvely change dstrbuton of tranng data by focusng more on prevously msclassfed records Intally, all N records are assgned equal weghts Unlke baggng, weghts may change at the end of a boostng round Dfferent mplementatons vary n terms of (1) how the weghts of the tranng examples are updated and (2) how the predctons are combned 16
Boostng Records that are wrongly classfed wll have ther weghts ncreased Records that are classfed correctly wll have ther weghts decreased Orgnal Data 1 2 3 4 5 6 7 8 9 10 Boostng (Round 1) 7 3 2 8 7 9 4 10 6 3 Boostng (Round 2) 5 4 9 4 2 5 1 7 4 2 Boostng (Round 3) 4 4 8 10 4 5 4 6 3 4 Example 4 s hard to classfy Its weght s ncreased, therefore t s more lkely to be chosen agan n subsequent rounds 17
Boostng Equal weghts are assgned to each tranng nstance (1/N for round 1) at frst round After a classfer C s learned, the weghts are adjusted to allow the subsequent classfer C +1 to pay more attenton to data that were msclassfed by C. Fnal boosted classfer C combnes the votes of each ndvdual classfer Weght of each classfer s vote s a functon of ts accuracy Adaboost popular boostng algorthm 18
Adaboost (Adaptve Boost) Input: Tranng set D contanng N nstances T rounds A classfcaton learnng scheme Output: A composte model 19
Adaboost: Tranng Phase Tranng data D contan N labeled data : (X 1, y 1 ), (X 2, y 2 ), (X 3, y 3 ),. (X N, y N ) Intally assgn equal weght 1/N to each data To generate T base classfers, we need T rounds or teratons Round : data from D are sampled wth replacement, to form D (sze N) Each data s chance of beng selected n the next rounds depends on ts weght Each tme the new sample s generated drectly from the tranng data D wth dfferent samplng probablty accordng to the weghts; these weghts are not zero 20
Adaboost: Tranng Phase Base classfer C, s derved from tranng data of D Error of C s tested usng D Weghts of tranng data are adjusted dependng on how they were classfed Correctly classfed: Decrease weght Incorrectly classfed: Increase weght Weght of a data ndcates how hard t s to classfy t (drectly proportonal) 21
Adaboost: Testng Phase 22 The lower a classfer error rate, the more accurate t s, and therefore, the hgher ts weght for votng should be Weght of a classfer C s vote s 1 1 ln 2 Testng: For each class c, sum the weghts of each classfer that assgned class c to X (unseen data) The class wth the hghest sum s the WINNER! C *( x ) arg max C test y T 1 ( x test ) y
Example: Error and Classfer Weght n AdaBoost Base classfers: C 1, C 2,, C T Error rate: = ndex of classfer j=ndex of nstance 1 N j j) N j 1 Importance of a classfer: w C ( x y j 1 2 1 ln 23
Example: Data Instance Weght n AdaBoost Assume: N tranng data n D, T rounds, (x j, y j ) are the tranng data, C, a are the classfer and weght of the th round, respectvely Weght update on all tranng data n D: w ( 1) j where w Z Z ( ) j s exp exp f f C ( x the normalzaton factor C ( x j j ) ) y y j j 24 C *( x ) arg max C test y T 1 ( x test ) y
Illustratng AdaBoost Intal weghts for each data pont Data ponts for tranng Orgnal 0.1 0.1 0.1 - - - - + + Data + + + - Boostng B1 0.0094 0.0094 0.4623 - - - - - - Round 1 + + + - = 1.9459 25
Illustratng AdaBoost Boostng Round 1 + + + - Boostng B1 0.0094 0.0094 0.4623 - - - - - - B2 0.3037 0.0009 0.0422 Round 2 - - - - - - - - + + = 1.9459 = 2.9323 0.0276 0.1819 0.0038 Boostng Round 3 + + + + + + + + + + B3 = 3.8744 26 Overall + + + - - - - - + +
Baggng and Boostng Summary Baggng: o Resample data ponts o Weght of each classfer s the same o Only varance reducton o Robust to nose and outlers Boostng: o Reweght data ponts (modfy data dstrbuton) o Weght of classfer vary dependng on accuracy o Reduces both bas and varance o Can hurt performance wth nose and outlers 27