CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University
Definition of Ensemble Learning Ensemble learning is a machine learning paradigm where multiple learners are trained to solve the same problem. In contrast to ordinary machine learning approaches which try to learn one hypothesis from training data, ensemble methods try to construct a set of hypotheses and combine them to use.
Why ensemble learning? Accuracy: a more reliable mapping can be obtained by combining the output of multiple experts Efficiency: a complex problem can be decomposed into multiple sub problems that are easier to understand and solve (divide-and-conquer approach) There is not a single model that works for all ML problems 3
4
Bias-variance tradeoff Simple (a.k.a. weak or base) learners E.g., SVM, logistic regression, simple discriminant function Low variance, don t usually overfit High bias, can t solve hard learning problems Can we make weak learners always good?? No. Often yes though. 5
Ensemble methods Learn multiple weak learners which are good at different parts of the input space Weak learners: Homogeneous/heterogeneous weak learners Parallel or sequential style Should be as more accurate as possible and as more diverse as possible How to combine the weak learners? Majority voting Weighted averaging for regression 6
Accuracy/diversity for learners Easy to estimate accuracy of learners? How about diversity? No rigorous definition to measure it The diversity of the base learners can be introduced from different channels, such as Subsampling the training examples, Manipulating the attributes, Manipulating the outputs, Injecting randomness into learning algorithms 7
Subsampling the training examples Multiple hypotheses are generated by training individual classifiers on different datasets obtained by resampling a common training set Manipulating the input features Multiple hypotheses are generated by training individual classifiers on different representations or different subsets of a common feature vector 8
Manipulating the output targets The output targets for C classes are encoded with an L- bit codeword, and an individual classifier is built to predict each one of the bits in the codeword Modifying the learning parameters of the classifier A number of classifiers are built with different learning parameters, such as number of neighbors in a k- Nearest Neighbor rule. 9
Ensemble methods Ensemble methods Boosting AdaBoost Bagging Bootstrap sampling Random Forests: a variant of bagging Stacking A number of first-level individual learners are generated from the training data set by different learning algorithms The individual learners are combined by a second-level learners 10
Boosting [Schapire 89] Output class: weighted vote of each learner Let h t (x) be the output of t th classifier that learns about different parts of the input space The decision H(x) can be made by a weighted linear combination 11
Boosting [Schapire 89] Given a weak learner, run it multiple times on training data, let learned classifiers vote On each iteration t: - Weight each training example by how incorrectly it was classified - Learn a weak hypothesis: h t - A strength for this hypothesis: α t Final classifier: H x = sign( t α t h t (x)) 12
Learning from weighted data Consider a weighted dataset D(i): weight of i-th training example x i, y i When resampling data, may get more samples of weighted data points. 13
AdaBoost [Freund & Schapire 95] Given: x 1, y 1,, x n, y n where y i = 1, 1 Initialize D 1 i = 1/n // initially equal weights For t = 1,, T: Train weak learner using distribution D t Get weak classifier h t : X R Choose α t R Update: D t+1 i = D t(i) Z t e α t, if y i = h t x i e α t, if y i h t x i = D t i exp( α t y i h t x i ) Z t // increase weight if wrong 14
AdaBoost [Freund & Schapire 95] Z t is a normalization factor Z t = n i=1 D t i exp( α t y i h t x i ) Final classifier T H x = sign( t=1 α t h t x ) 15
How to choose α t? Weight update rule: D t+1 i = D t i exp( α t y i h t x i ) Z t Weighted training error ε t = α t = 1 2 ln 1 ε t ε t n i=1 D t i δ(h t (x i ) y i ) ε t = 0 if h t perfectly classifies all weighted data (α t = ) ε t = 1 if h t perfectly wrong (α t = ) ε t = 0.5 (α t = 0) 16
Boosting Example 17
Boosting Example 18
Hard & Soft Decision Weighted average of weak learners: f x = t α t h t (x) Hard Decision/Predicted Label: H x = sign f x Soft Decision Based on analogy with logistic regression P Y = 1 X = 1 1+exp(f x ) 19
Effect of Outliers Can identify outliers since focuses on examples that are hard to categorize Too many outliers can degrade classification performance dramatically increase time to convergence 20
Bagging [Breiman 96] Run independent weak learners on bootstrap replicates (sampling with replacement) of the training set Average/vote over weak hypotheses Difference with Boosting Bagging Resampling data Same weight of each classifier Only variance reduction Boosting Reweight data Weight is dependent on classifier s accuracy Both bias and variance reduced 21