2D1431 Machine Learning. Bagging & Boosting

2D1431 Machine Learning Bagging & Boosting

Outline Bagging and Boosting Evaluating Hypotheses Feature Subset Selection Model Selection

Question of the Day Three salesmen arrive at a hotel one night and ask for a room. The manager informs them the price is $30.00. Each gives the manager a ten, and they retire for the night. Shortly, the manager remembers that the room was at a discount, on account of it being haunted. So he tells his bellhop that the room was only $25.00, gives the bellhop five dollars and tells him to give the men the refund. The bellhop is slightly crooked and rationalizes, "Five doesn't divide well among three, I'll save them some arguing and just give them a dollar each." Which he does, and keeps the leftover two dollars for himself. Now each of the men paid $9.00 for the room, for a total of $27.00. The bellhop has $2.00. $27+$2=$29. What happened to the missing dollar?

Classification Problem Assume a set s of N instances x i X each belonging to one of M classes { c 1, c M }. The training set consists of pairs (x i,c i ). A classifier C assigns a classification C(x) { c 1, c M } to an instance x. The classifier learned in trial t is denoted C t while C* is the composite boosted classifier.

Linear Discriminant Analysis Possible classes : { x, o } LDA : x if w 1 x 1 + w 2 x 2 + w 3 > 0 x 2 o otherwise x x x x o w 1 x 1 +w 2 x 2 +w 3 =0 x o o o o x 1

Bagging & Boosting Bagging and Boosting aggregate multiple hypotheses generated by the same learning algorithm invoked over different distributions of training data [Breiman 1996, Freund & Schapire 1996]. Bagging and Boosting generate a classifier with a smaller error on the training data as it combines multiple hypotheses which individually have a larger error.

Bagging For each trial t=1,2,,t create a bootstrap sample S t. Obtain an hypothesis C t on the bootstrap sample S t. The final classification is the majority class c BA (x) = argmax c Σ j C t δ(c j,c t (x)) x c BA (x)=+ C 1 (x)=+ C 2 (x)=- C T (x) =+ train train train Training set 1 Training set 2 Training set T

Bagging From the overall training S set randomly sample (with replacement) T different training sets S 1,...,S T of size N. For each sample set S t obtain a hypothesis C t. To an unseen instance x assign the majority classification c BA (x) among the hypotheses C t classifications c t (x). x c BA (x)=argmax c j C Σ t δ(c j,c t (x)) C 1 c C 2 C T 1 (x) c 2 (x) c T (x) train train train S 1 S 2 S T

Bagging Bagging requires instable classifiers like for example decision trees or concept learner. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy. (Breiman 1996)

Boosting Boosting maintains a weight w i for each instance <x i, c i > in the training set. The higher the weight w i, the more the instance x i influences the next hypothesis learned. At each trial, the weights are adjusted to reflect the performance of the previously learned hypothesis, with the result that the weight of correctly classified instances is decreased and the weight of incorrectly classified instances is increased.

Boosting Construct an hypothesis C t from the current distribution of instances described by w t. Adjust the weights according to the classification error ε t of classifier C t. The strength α t of a hypothesis depends on its training error et. α t = ½ ln ((1-ε t )/ε t ) Set of weighted w i t instances x i learn hypothesis adjust weights hypothesis Ct strength α t

Boosting The final hypothesis C BO aggregates the individual hypotheses C t by weighted voting. c BO (x) = argmax cj C Σ t=1t α t δ(c j,c t (x)) Each hypothesis vote is a function of its accuracy. Let w it denote the weight of an instance x i at trial t, for every x i, w i1 = 1/N. The weight w it reflects the importance (e.g. probability of occurrence) of the instance x i in the sample set S t. At each trial t = 1,,T an hypothesis C t is constructed from the given instances under the distribution w t. This requires that the learning algorithm can deal with fractional examples.

Boosting The error of the hypothesis C t is measured with respect to the weights ε t = Σ i Ct(xi) w t ci i / Σ i w t i α t = ½ ln ((1-ε t )/ε t ) Update the weights w it of correctly and incorrectly classified instances by w t+1 i = w it e -αt if C t (x i ) = c i w t+1 i = w it e αt if C t (x i ) c i Afterwards normalize the w t+1 i such that they form a proper distribution Σ i w t+1 i =1

Boosting The classification c BO (x) of the boosted hypothesis is obtained by summing the votes of the hypotheses C 1, C 2,, C T where the vote of each hypothesis C t is weights α t. x c BO (x) = argmax cj C Σ t=1t α t δ(c j,c t (x)) C 1 c 1 (x) C 2 c 2 (x) C T c T (x) train train train S,w 1 S,w 2 S,w T

Boosting Given {<x 1,c 1 >, <x m,c m >} Initialize w i1 =1/m For t=1,,t train weak learner using distribution wit get weak hypothesis C t : X C with error ε t = Σ i C t(xi) w t ci i choose α t = ½ ln ((1-ε t )/ε t ) / Σ i w t i update: w t+1 i = w it e -αt if C t (x i ) = c i w t+1 i = w it e αt if C t (x i ) c i Output the final hypothesis c BO (x) = argmax cj C Σ t=1t α t δ(c j,c t (x))

Bayes MAP Hypothesis Bayes MAP hypothesis for two classes x and o red : incorrect classified instances

Boosted Bayes MAP Hypothesis Boosted Bayes MAP hypothesis has more complex decision surface than individual hypotheses alone

Feature Subset Selection Input features Feature subset search Induction algorithm Feature subset evaluation Feature subset evaluation

Feature Subset Selection Evaluate the accuracy of the hypothesis for a given subset of features by means of n-fold cross-validation Forward elimination: Starts with the empty set of features and greedily adds the feature that improves the estimated accuracy at most Backward elimination: Starts with the set of all features and greedily removes the worst feature Forward-backward elimination: Alternate between backward and forward elimination choosing whichever step improves the accuracy of the current subset of features.

Question of the Day What happened to the missing dollar? This classic is a simple wording deception that occurs in the last sentence: "Now each of the men paid $9.00 for the room, for a total of $27.00. The bellhop has $2.00. $27+$2=29. What happened to the missing dollar?" There is no missing dollar, the arithmetic above makes no sense in the context of the problem. There's no reason to add the two dollars the bellboy kept to the nine dollars each of the men spent. Each man paid nine dollars for a total of twenty-even dollars. Of that twenty-seven, two went to the bell boy, twentyfive went to the hotel. The equation should be: $27-$2=$25, the price of the room.