Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Boosting CAP5610: Machine Learning Instructor: Guo-Jun Qi

Weak classifiers Weak classifiers Decision stump one layer decision tree Naive Bayes A classifier without feature correlations Linear classifier logistic regression Weak classifiers usually have larger training error but smaller variance. A single weak classifier is usually not adequate in real applications, but it is possible to combine an ensemble of weak classifiers to build a strong one.

Idea: weighted voting Combining an ensemble of weak classifiers by weighted voting Learning an ensemble of weak classifiers Although each weak classifier is not adequate of classifying the whole feature space, it can still output good result on certain parts of feature spaces. Each weak classifier is given a weight based on its performance More adequate classifier will vote with more weight. Weighted voting usually generates better performance by combining complementary classifiers good at classifying different parts of feature spaces. h 1 h 2 Combined classifier: f x = sign α 1 h 1 x + α 2 h 2 x

Problems to solve How are an ensemble of weak classifiers learned? Decide which part each weak classifier focuses on. How is the weight of each weak classifier decided?

Boosting [Schapire 98 ] Idea: learning a pool of weak classifiers (usually of the same type e.g., stump, logistic regression), on different sets of training examples resampled from different parts of an original training set h 1 Two resampled sets h 1 Combined h 2 h 2 Original training set

Boosting The Algorithm On each iteration t: Weight each training example by how correctly it is classified so far Learn a weak classifier h t that best classifies the weighted training examples. Decide a strength for this weak classifier α t Final classifier: f x = sign( t α t h t x )

Learning from Weighted Training Examples Consider a weighted dataset D(i) weight of i-th training example x i, y i Interpretations: i-th example is counted as D(i) examples i-th example is resampled from training set by weight D(i) Two ways to learn a weak classifier from weighted training examples Resampling the training set by D(i), and train a weak classifier from the resampled set Learn a weak classifier directly from the weighted samples, e.g., a weighted logistic regression classifier h = min h i D i loss(h(x i ), y i )

AdaBoost [Freund & Shapire 95] Output final classifier

Decide the combination weight for each weak classifier Weight of weak classifier α t = 1 2 log 1 ε t ε t Where ε t is the weighted training error ε t = i D i δ(h t x i y i ) If a classifier is better than a random guess, ε t < 0.5, and α t > 0; otherwise, α t < 0. For the latter case, it is an adverse classifier rather than a weak classifier.

Boosting Example (Decision Stump) Three weak classifier

Boosting Example (Decision Stump) Final classifier

How good can the boosting reduce the training error? If each weak classifier h t is slightly better than random guess with ε t < 0.5, then the training error of Adaboost can be reduced exponentially fast in the number of weak classifiers combined T, It is astounding result as AdaBoost can reduce the training error to arbitrarily close to zero.

Proof: Training error of AdaBoost Note the fact that exponential function exp( y i f x i ) is an upper bound of the 0/1 loss δ(y i f x i ) as f x i

Proof: Training error of AdaBoost The total training error is bounded by where f x i = t α t h t (x i ), and H x i = sign(f x i ) is the final classifier.

Proof: Training error of AdaBoost D 1 i = 1 m D 2 i = exp( y iα 1 h 1 x i ) mz 1 D 3 i = D 2 i exp y i α 2 h 2 x i Z 2 = exp( y iα 1 h 1 x i )exp( y i α 2 h 2 x i ) mz 1 Z 2 By induction, we have D T i = D T 1exp( y i α T h T x i ) Z T = exp( t y iα t h t x i ) mz 1 Z 2 Z T = exp( y if x i ) mz 1 Z 2 Z T From i D T i = 1, we have t Z t = 1 m i exp( y i f x i ).

Proof: Training error of AdaBoost So we have 1 m i δ y i f x i 1 m i exp( y i f x i ) = Π t Z t If Z t <1, the training error decays exponentially.

Proof: Training error of AdaBoost Finding an optimal weight α t by minimizing Z t Z t = i D t (i)exp( α t y i h t x i ) = ht x i y i D t i exp(α t ) + ht x i =y i D t i exp( α t ) = exp α t ε t + exp α t (1 ε t ) Z t α t = exp α t ε t exp α t 1 ε t = 0, thus the optimal α t = 1 2 log 1 ε t ε t, and Z t = 2 ε t (1 ε t ) = 1 1 2ε t 2

Proof: Training error of AdaBoost Training error is bounded by

Digital recognition Test error still decreases even after training error reaches zero. This shows boosting is robust to over fitting?

Comparison between LR and Boosting Logistic Regression assumes 1 P y i = 1 x i =, f x = 1+exp( y i f x i ) d w d x d + w 0 Maximizing the data log likelihood max log(1 + exp y if x i ) w Or min log(1 + exp y if x i ) w

Comparison between LR and Boosting Logistic Regression Boosting min w log(1 + exp y if x i ), f x = min h 1 m d w d x d + w 0 i exp( y i f x i ) = t Z t, f x = t α t h t x Comparing α t w d and h t x d, LR and Boosting becomes comparable. For LR all w d are jointly learned, but for Boosting, h t are learned sequentially. Note that LR is linear classifier, but Boosting is not.

Attention A common mistake: Even if you choose a linear classifier for h t, the final classifier is not linear in Boosting. A sign function should be taken for h t x = sign(w T x), otherwise a linear function of a set of linear functions f x = is still linear which we do not desire. t α t h t x

Summary We have learned to use a set of week classifiers to build a strong classifier, which Can reduce the training error arbitrarily close to zero. Even if the weak classifier is only slightly better than random guess A particular Boosting algorithm: Adaboost Compare the Logistic Regression and Boosting Linear VS. Nonlinear Joint optimization VS. iterative optimization