Boosting & Deep Learning

Similar documents
CS7267 MACHINE LEARNING

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Learning with multiple models. Boosting.

Voting (Ensemble Methods)

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

A Brief Introduction to Adaboost

Machine Learning. Ensemble Methods. Manfred Huber

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Bagging and Other Ensemble Methods

Ensembles of Classifiers.

Learning theory. Ensemble methods. Boosting. Boosting: history

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

TDT4173 Machine Learning

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Neural Networks and Ensemble Methods for Classification

Boos$ng Can we make dumb learners smart?

Data Mining und Maschinelles Lernen

Decision Trees: Overfitting

Algorithm-Independent Learning Issues

Ensembles. Léon Bottou COS 424 4/8/2010

Hierarchical Boosting and Filter Generation

Data Warehousing & Data Mining

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Stochastic Gradient Descent

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

1 Handling of Continuous Attributes in C4.5. Algorithm

CS534 Machine Learning - Spring Final Exam

1 Handling of Continuous Attributes in C4.5. Algorithm

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

ECE 5424: Introduction to Machine Learning

Chapter 14 Combining Models

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Bias-Variance in Machine Learning

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

2D1431 Machine Learning. Bagging & Boosting

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Ensemble Methods for Machine Learning

Statistics and learning: Big Data

TDT4173 Machine Learning

Boosting: Foundations and Algorithms. Rob Schapire

COMS 4771 Lecture Boosting 1 / 16

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

ECE 5984: Introduction to Machine Learning

Cross Validation & Ensembling

VBM683 Machine Learning

Variance Reduction and Ensemble Methods

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Lecture 3: Decision Trees

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

CS229 Supplemental Lecture notes

Ensemble Methods: Jay Hyer

Adaptive Boosting of Neural Networks for Character Recognition

Ensemble Learning in the Presence of Noise

2 Upper-bound of Generalization Error of AdaBoost

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Decision Trees. Danushka Bollegala

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Lecture 8. Instructor: Haipeng Luo

Notation P(Y ) = X P(X, Y ) = X. P(Y X )P(X ) Teorema de Bayes: P(Y X ) = CIn/UFPE - Prof. Francisco de A. T. de Carvalho P(X )

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

i=1 = H t 1 (x) + α t h t (x)

CS 543 Page 1 John E. Boon, Jr.

Machine Learning

The Perceptron algorithm

18.9 SUPPORT VECTOR MACHINES

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7]

Learning from Examples

PCA FACE RECOGNITION

COS 429: COMPUTER VISON Face Recognition

Diversity-Based Boosting Algorithm

Statistical Machine Learning from Data

Machine Learning

6.036 midterm review. Wednesday, March 18, 15

Announcements Kevin Jamieson

Learning Decision Trees

UVA CS 4501: Machine Learning

CS 231A Section 1: Linear Algebra & Probability Review

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

PDEEC Machine Learning 2016/17

CS 6375 Machine Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Logistic Regression and Boosting for Labeled Bags of Instances

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Learning Decision Trees

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Ângelo Cardoso 27 May, Symbolic and Sub-Symbolic Learning Course Instituto Superior Técnico

CS145: INTRODUCTION TO DATA MINING

Transcription:

Boosting & Deep Learning Ensemble Learning n So far learning methods that learn a single hypothesis, chosen form a hypothesis space that is used to make predictions n Ensemble learning à select a collection (ensemble) of hypotheses and combine their predictions n Example n Generate 100 different decision trees from the same or different training set and have them vote on the best classification for a new example. n Key motivation n Reduce the error rate. Hope is that it will become much more unlikely that the ensemble will misclassify an example. 1

Ensemble Learning n The generalization ability of the ensemble is usually significantly better than that of an individual learner n Boosting is one of the most important families of ensemble methods Learning Ensembles n Learn multiple alternative definitions of a concept using different training data or different learning algorithms. Training Data Data1 Data2 Data m Learner1 Learner2 Learner m Model1 Model2 Model m Model Combiner Final Model 2

Value of Ensembles n No Free Lunch Theorem n No single algorithm wins all the time! n When combing multiple independent and diverse decisions each of which is at least more accurate than random guessing, random errors cancel each other out, correct decisions are reinforced. n Examples: Human ensembles are demonstrably better n How many jelly beans in the jar?: Individual estimates vs. group average. Example: Weather Forecast Reality 1 2 3 4 5 Combine X X X X X X X X X X X X X 3

Ensemble Learning n Ensemble learning: n Way of enlarging the hypothesis space, i.e., the ensemble itself is a hypothesis and the new hypothesis space is the set of all possible ensembles constructible form hypotheses of the original space Three linear threshold hypothesis (positive examples on the non-shaded side); Ensemble classifies as positive any example classified positively be all three. The resulting triangular region hypothesis is not expressible in the original hypothesis space. Different Learners n Different learning algorithms n Algorithms with different choice for parameters n Data set with different features n Data set = different subsets 4

Homogenous Ensembles n Use a single, arbitrary learning algorithm but manipulate training data to make it learn multiple models. n Data1 Data2 Data m n Learner1 = Learner2 = = Learner m n Different methods for changing training data: n Bagging: Resample training data n Boosting: Reweight training data Bagging n Create ensembles by repeatedly randomly resampling the training data (Brieman, 1996) n Given a training set of size n, create m samples of size n by drawing n examples from the original data, with replacement n Each bootstrap sample will on average contain 63.2% of the unique training examples, the rest are replicates n Combine the m resulting models using simple majority vote 5

Bagging n Combine the m resulting models using simple majority vote n Decreases error by decreasing the variance in the results due to unstable learners, algorithms (like decision trees) whose output can change dramatically when the training data is slightly changed Bagging Aggregate Bootstrapping Given a standard training set D of size n For i = 1.. M Draw a sample of size n * < n from D uniformly and with replacement Learn classifier C i Final classifier is a vote of C 1.. C M Increases classifier stability/reduces variance 6

Strong and Weak Learners n Strong Learner: n Objective of machine learning n Take labeled data for training n Produce a classifier which can be arbitrarily accurate n Weak Learner n Take labeled data for training n Produce a classifier which is more accurate than random guessing Boosting n Weak Learner: only needs to generate a hypothesis with a training accuracy greater than 0.5, i.e., < 50% error over any distribution n Learners n Strong learners are very difficult to construct n Constructing weaker Learners is relatively easy n Questions: Can a set of weak learners create a single strong learner? Yes: Boost weak classifiers to a strong learner 7

Boosting n Originally developed by computational learning theorists to guarantee performance improvements on fitting training data for a weak learner that only needs to generate a hypothesis with a training accuracy greater than 0.5 (Schapire, 1990). n Revised to be a practical algorithm, AdaBoost, for building ensembles that empirically improves generalization performance (Freund & Shapire, 1996). Boosting n Examples are given weights n At each iteration, a new hypothesis is learned and the examples are reweighted to focus the system on examples that the most recently learned classifier got wrong 8

Boosting n Revised to be a practical algorithm, AdaBoost, for building ensembles that empirically improves generalization performance (Freund & Shapire, 1996). n Instead of sampling (as in bagging) re-weigh examples! n Examples are given weights. n At each iteration, a new hypothesis is learned (weak learner) and the examples are reweighted to focus the system on examples that the most recently learned classifier got wrong. n Final classification based on weighted vote of weak classifiers Adaptive Boosting Each rectangle corresponds to an example, with weight proportional to its height. Crosses correspond to misclassified examples. Size of decision tree indicates the weight of that hypothesis in the final ensemble. 9

Construct Weak Classifiers n Using Different Data Distribution n Start with uniform weighting n During each step of learning Increase weights of the examples which are not correctly learned by the weak learner Decrease weights of the examples which are correctly learned by the weak learner n Idea n Focus on difficult examples which are not correctly classified in the previous steps Combine Weak Classifiers n Weighted Voting n Construct strong classifier by weighted voting of the weak classifiers n Idea n Better weak classifier gets a larger weight n Iteratively add weak classifiers Increase accuracy of the combined classifier through minimization of a cost function 10

Adaptive Boosting: High Level Description C =0; /* counter*/ M = m; /* number of hypotheses to generate*/ 1 Set same weight for all the examples (typically each example has weight = 1); 2 While (C < M) 2.1 Increase counter C by 1. 2.2 Generate hypothesis h C. 2.3 Increase the weight of the misclassified examples in hypothesis h C 3 Weighted majority combination of all M hypotheses (weights according to how well it performed on the training set). n Many variants depending on how to set the weights and how to combine the hypotheses. n ADABOOST à quite popular!!!! 11

Boosting: Basic Algorithm n General Loop: Set all examples to have equal uniform weights. For t from 1 to T do: Learn a hypothesis, h t, from the weighted examples Decrease the weights of examples h t classifies correctly n Base (weak) learner must focus on correctly classifying the most highly weighted examples while strongly avoiding over-fitting. n During testing, each of the T hypotheses get a weighted vote proportional to their accuracy on the training data. AdaBoost Pseudocode TrainAdaBoost(D, BaseLearn) { For each example d i in D let its weight w i =1/ D Let H be an empty set of hypotheses For t from 1 to T do: { Learn a hypothesis, h t, from the weighted examples: h t =BaseLearn(D) Add h t to H Calculate the error, ε t, of the hypothesis h t as the total sum weight of the examples that it classifies incorrectly. If ε t > 0.5 then T:=t-1 and recommence loop Let β t = ε t / (1 ε t ) (or β t = ½ log( ε t / (1 ε t ) ) Multiply the weights of the examples that h t classifies correctly by β t Rescale the weights of all of the examples so the total sum weight remains 1. } Return H } 12

AdaBoost Pseudocode TestAdaBoost(ex, H) { Let each hypothesis, h t, in H vote for ex s classification with weight log(1/ β t ) Return the class with the highest weighted vote total. } Performance of Adaboost n Learner = Hypothesis = Classifier n Weak Learner: < 50% error over any distribution n M number of hypothesis in the ensemble. n If the input learning is a Weak Learner, then ADABOOST will return a hypothesis that classifies the training data perfectly for a large enough M, boosting the accuracy of the original learning algorithm on the training Data. n Strong Classifier: thresholded linear combination of weak learner outputs. 13

Experimental Results on Ensembles (Freund & Schapire, 1996; Quinlan, 1996) n Ensembles have been used to improve generalization accuracy on a wide variety of problems. n On average, Boosting provides a larger increase in accuracy than Bagging. n Boosting on rare occasions can degrade accuracy. n Boosting is particularly subject to over-fitting when there is significant noise in the training data. Deep Network Training n Use unsupervised learning (greedy layer-wise training) n Allows abstraction to develop naturally from one layer to another n Help the network initialize with good parameters n Perform supervised top-down training as final step n Refine the features (intermediate layers) so that they become more relevant for the task 14

n Deep Architectures can be representationally efficient n Fewer computational units for same function n Deep Representations might allow for a hierarchy Comprehensibility n Multiple levels of latent variables allow combinatorial sharing of statistical strength Different Levels of Abstraction n Hierarchical Learning n Natural progression from low level to high level structure as seen in natural complexity n Easier to monitor what is being learnt and to guide the machine to better subspaces n A good lower level representation can be used for many distinct tasks 15

16

23/11/15 Hierarchy of representations with increasing level of abstraction n Each stage is a kind of trainable feature transform n Image recognition Pixel edge texton motif part object n Text Character word word group clause sentence story n Speech Sample spectral band sound... phone phoneme word 17

18

19

20

21

Receptive Fields in Retinal Bipolar and Ganglion Cells 22

Receptive Fields of Lateral Geniculate and Primary Visual Cortex 23

24

25

Face Cells in Monkey Computational Model of Object Recognition (Riesenhuber and Poggio, 1999) 26

n n Image passed through layers of units with progressively more complex features at progressively less specific locations. Hierarchical in that features at one stage are built from features at earlier stages 27

Hierarchical Template Matching: Fukushima & Miyake (1982) s Neocognitron 28

29

First S-layer after learning Second S-layer 30

Third S-layer Fourth S-cell layer 31

n Exemplo de fatores reconhecidos pelos neurônios nos 3 estágios de uma implementação do neocognitron. Us1 Uc1 Us2 Uc2 Us3 Uc3 U0 32