Boosting & Deep Learning Ensemble Learning n So far learning methods that learn a single hypothesis, chosen form a hypothesis space that is used to make predictions n Ensemble learning à select a collection (ensemble) of hypotheses and combine their predictions n Example n Generate 100 different decision trees from the same or different training set and have them vote on the best classification for a new example. n Key motivation n Reduce the error rate. Hope is that it will become much more unlikely that the ensemble will misclassify an example. 1
Ensemble Learning n The generalization ability of the ensemble is usually significantly better than that of an individual learner n Boosting is one of the most important families of ensemble methods Learning Ensembles n Learn multiple alternative definitions of a concept using different training data or different learning algorithms. Training Data Data1 Data2 Data m Learner1 Learner2 Learner m Model1 Model2 Model m Model Combiner Final Model 2
Value of Ensembles n No Free Lunch Theorem n No single algorithm wins all the time! n When combing multiple independent and diverse decisions each of which is at least more accurate than random guessing, random errors cancel each other out, correct decisions are reinforced. n Examples: Human ensembles are demonstrably better n How many jelly beans in the jar?: Individual estimates vs. group average. Example: Weather Forecast Reality 1 2 3 4 5 Combine X X X X X X X X X X X X X 3
Ensemble Learning n Ensemble learning: n Way of enlarging the hypothesis space, i.e., the ensemble itself is a hypothesis and the new hypothesis space is the set of all possible ensembles constructible form hypotheses of the original space Three linear threshold hypothesis (positive examples on the non-shaded side); Ensemble classifies as positive any example classified positively be all three. The resulting triangular region hypothesis is not expressible in the original hypothesis space. Different Learners n Different learning algorithms n Algorithms with different choice for parameters n Data set with different features n Data set = different subsets 4
Homogenous Ensembles n Use a single, arbitrary learning algorithm but manipulate training data to make it learn multiple models. n Data1 Data2 Data m n Learner1 = Learner2 = = Learner m n Different methods for changing training data: n Bagging: Resample training data n Boosting: Reweight training data Bagging n Create ensembles by repeatedly randomly resampling the training data (Brieman, 1996) n Given a training set of size n, create m samples of size n by drawing n examples from the original data, with replacement n Each bootstrap sample will on average contain 63.2% of the unique training examples, the rest are replicates n Combine the m resulting models using simple majority vote 5
Bagging n Combine the m resulting models using simple majority vote n Decreases error by decreasing the variance in the results due to unstable learners, algorithms (like decision trees) whose output can change dramatically when the training data is slightly changed Bagging Aggregate Bootstrapping Given a standard training set D of size n For i = 1.. M Draw a sample of size n * < n from D uniformly and with replacement Learn classifier C i Final classifier is a vote of C 1.. C M Increases classifier stability/reduces variance 6
Strong and Weak Learners n Strong Learner: n Objective of machine learning n Take labeled data for training n Produce a classifier which can be arbitrarily accurate n Weak Learner n Take labeled data for training n Produce a classifier which is more accurate than random guessing Boosting n Weak Learner: only needs to generate a hypothesis with a training accuracy greater than 0.5, i.e., < 50% error over any distribution n Learners n Strong learners are very difficult to construct n Constructing weaker Learners is relatively easy n Questions: Can a set of weak learners create a single strong learner? Yes: Boost weak classifiers to a strong learner 7
Boosting n Originally developed by computational learning theorists to guarantee performance improvements on fitting training data for a weak learner that only needs to generate a hypothesis with a training accuracy greater than 0.5 (Schapire, 1990). n Revised to be a practical algorithm, AdaBoost, for building ensembles that empirically improves generalization performance (Freund & Shapire, 1996). Boosting n Examples are given weights n At each iteration, a new hypothesis is learned and the examples are reweighted to focus the system on examples that the most recently learned classifier got wrong 8
Boosting n Revised to be a practical algorithm, AdaBoost, for building ensembles that empirically improves generalization performance (Freund & Shapire, 1996). n Instead of sampling (as in bagging) re-weigh examples! n Examples are given weights. n At each iteration, a new hypothesis is learned (weak learner) and the examples are reweighted to focus the system on examples that the most recently learned classifier got wrong. n Final classification based on weighted vote of weak classifiers Adaptive Boosting Each rectangle corresponds to an example, with weight proportional to its height. Crosses correspond to misclassified examples. Size of decision tree indicates the weight of that hypothesis in the final ensemble. 9
Construct Weak Classifiers n Using Different Data Distribution n Start with uniform weighting n During each step of learning Increase weights of the examples which are not correctly learned by the weak learner Decrease weights of the examples which are correctly learned by the weak learner n Idea n Focus on difficult examples which are not correctly classified in the previous steps Combine Weak Classifiers n Weighted Voting n Construct strong classifier by weighted voting of the weak classifiers n Idea n Better weak classifier gets a larger weight n Iteratively add weak classifiers Increase accuracy of the combined classifier through minimization of a cost function 10
Adaptive Boosting: High Level Description C =0; /* counter*/ M = m; /* number of hypotheses to generate*/ 1 Set same weight for all the examples (typically each example has weight = 1); 2 While (C < M) 2.1 Increase counter C by 1. 2.2 Generate hypothesis h C. 2.3 Increase the weight of the misclassified examples in hypothesis h C 3 Weighted majority combination of all M hypotheses (weights according to how well it performed on the training set). n Many variants depending on how to set the weights and how to combine the hypotheses. n ADABOOST à quite popular!!!! 11
Boosting: Basic Algorithm n General Loop: Set all examples to have equal uniform weights. For t from 1 to T do: Learn a hypothesis, h t, from the weighted examples Decrease the weights of examples h t classifies correctly n Base (weak) learner must focus on correctly classifying the most highly weighted examples while strongly avoiding over-fitting. n During testing, each of the T hypotheses get a weighted vote proportional to their accuracy on the training data. AdaBoost Pseudocode TrainAdaBoost(D, BaseLearn) { For each example d i in D let its weight w i =1/ D Let H be an empty set of hypotheses For t from 1 to T do: { Learn a hypothesis, h t, from the weighted examples: h t =BaseLearn(D) Add h t to H Calculate the error, ε t, of the hypothesis h t as the total sum weight of the examples that it classifies incorrectly. If ε t > 0.5 then T:=t-1 and recommence loop Let β t = ε t / (1 ε t ) (or β t = ½ log( ε t / (1 ε t ) ) Multiply the weights of the examples that h t classifies correctly by β t Rescale the weights of all of the examples so the total sum weight remains 1. } Return H } 12
AdaBoost Pseudocode TestAdaBoost(ex, H) { Let each hypothesis, h t, in H vote for ex s classification with weight log(1/ β t ) Return the class with the highest weighted vote total. } Performance of Adaboost n Learner = Hypothesis = Classifier n Weak Learner: < 50% error over any distribution n M number of hypothesis in the ensemble. n If the input learning is a Weak Learner, then ADABOOST will return a hypothesis that classifies the training data perfectly for a large enough M, boosting the accuracy of the original learning algorithm on the training Data. n Strong Classifier: thresholded linear combination of weak learner outputs. 13
Experimental Results on Ensembles (Freund & Schapire, 1996; Quinlan, 1996) n Ensembles have been used to improve generalization accuracy on a wide variety of problems. n On average, Boosting provides a larger increase in accuracy than Bagging. n Boosting on rare occasions can degrade accuracy. n Boosting is particularly subject to over-fitting when there is significant noise in the training data. Deep Network Training n Use unsupervised learning (greedy layer-wise training) n Allows abstraction to develop naturally from one layer to another n Help the network initialize with good parameters n Perform supervised top-down training as final step n Refine the features (intermediate layers) so that they become more relevant for the task 14
n Deep Architectures can be representationally efficient n Fewer computational units for same function n Deep Representations might allow for a hierarchy Comprehensibility n Multiple levels of latent variables allow combinatorial sharing of statistical strength Different Levels of Abstraction n Hierarchical Learning n Natural progression from low level to high level structure as seen in natural complexity n Easier to monitor what is being learnt and to guide the machine to better subspaces n A good lower level representation can be used for many distinct tasks 15
16
23/11/15 Hierarchy of representations with increasing level of abstraction n Each stage is a kind of trainable feature transform n Image recognition Pixel edge texton motif part object n Text Character word word group clause sentence story n Speech Sample spectral band sound... phone phoneme word 17
18
19
20
21
Receptive Fields in Retinal Bipolar and Ganglion Cells 22
Receptive Fields of Lateral Geniculate and Primary Visual Cortex 23
24
25
Face Cells in Monkey Computational Model of Object Recognition (Riesenhuber and Poggio, 1999) 26
n n Image passed through layers of units with progressively more complex features at progressively less specific locations. Hierarchical in that features at one stage are built from features at earlier stages 27
Hierarchical Template Matching: Fukushima & Miyake (1982) s Neocognitron 28
29
First S-layer after learning Second S-layer 30
Third S-layer Fourth S-cell layer 31
n Exemplo de fatores reconhecidos pelos neurônios nos 3 estágios de uma implementação do neocognitron. Us1 Uc1 Us2 Uc2 Us3 Uc3 U0 32