Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Similar documents
Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Statistical Machine Learning from Data

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning with multiple models. Boosting.

TDT4173 Machine Learning

Voting (Ensemble Methods)

TDT4173 Machine Learning

Data Warehousing & Data Mining

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

CS7267 MACHINE LEARNING

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Boosting: Foundations and Algorithms. Rob Schapire

Machine Learning. Ensemble Methods. Manfred Huber

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Boosting: Algorithms and Applications

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Boos$ng Can we make dumb learners smart?

Hierarchical Boosting and Filter Generation

Boosting & Deep Learning

Lecture 8. Instructor: Haipeng Luo

Ensemble Methods for Machine Learning

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Reconnaissance d objetsd et vision artificielle

2D Image Processing Face Detection and Recognition

PCA FACE RECOGNITION

Data Mining und Maschinelles Lernen

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

A Brief Introduction to Adaboost

Robotics 2 AdaBoost for People and Place Detection

Chapter 14 Combining Models

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

ECE 661: Homework 10 Fall 2014

Face detection and recognition. Detection Recognition Sally

Ensembles. Léon Bottou COS 424 4/8/2010

Neural Networks and Ensemble Methods for Classification

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Decision Trees: Overfitting

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

CS 231A Section 1: Linear Algebra & Probability Review

VBM683 Machine Learning

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Robotics 2. AdaBoost for People and Place Detection. Kai Arras, Cyrill Stachniss, Maren Bennewitz, Wolfram Burgard

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

An overview of Boosting. Yoav Freund UCSD

Stochastic Gradient Descent

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Algorithm-Independent Learning Issues

ECE 5424: Introduction to Machine Learning

COS 429: COMPUTER VISON Face Recognition

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

ECE 5984: Introduction to Machine Learning

CS229 Supplemental Lecture notes

Computational and Statistical Learning Theory

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

ECE521 Lecture7. Logistic Regression

Ensemble Methods: Jay Hyer

Representing Images Detecting faces in images

2 Upper-bound of Generalization Error of AdaBoost

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

FINAL: CS 6375 (Machine Learning) Fall 2014

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Bagging and Other Ensemble Methods

Ensembles of Classifiers.

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

COMS 4771 Lecture Boosting 1 / 16

1 Handling of Continuous Attributes in C4.5. Algorithm

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

2D1431 Machine Learning. Bagging & Boosting

Notation P(Y ) = X P(X, Y ) = X. P(Y X )P(X ) Teorema de Bayes: P(Y X ) = CIn/UFPE - Prof. Francisco de A. T. de Carvalho P(X )

Stephen Scott.

1 Handling of Continuous Attributes in C4.5. Algorithm

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Minimax risk bounds for linear threshold functions

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

CS534 Machine Learning - Spring Final Exam

Large-Margin Thresholded Ensembles for Ordinal Regression

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Randomized Decision Trees

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Lecture 13: Ensemble Methods

i=1 = H t 1 (x) + α t h t (x)

Ensemble Methods and Random Forests

Cross Validation & Ensembling

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

10-701/ Machine Learning - Midterm Exam, Fall 2010

Announcements. Proposals graded

Large-Margin Thresholded Ensembles for Ordinal Regression

10701/15781 Machine Learning, Spring 2007: Homework 2

Introduction to Boosting and Joint Boosting

Transcription:

Outline: Ensemble Learning We will describe and investigate algorithms to Ensemble Learning Lecture 10, DD2431 Machine Learning A. Maki, J. Sullivan October 2014 train weak classifiers/regressors and how to combine them to construct a classifier/regressor more powerful than any of the individual ones. They are called Ensemble learning, Commitee machine, etc. Background/methods: Wisdom of Crowds Why combine classifiers? Bagging: static structure, parallel Boosting: static structure, serial (Example: face detection) The Wisdom of Crowds The Wisdom of Crowds - Really? The smartest guy in the room Crowd wiser than any individual Smarter than When? For which questions? The Crowd See The Wisdom of Crowds by James Surowiecki published in 2004 to see this idea applied to business. The collective knowledge of a diverse and independent body of people typically exceeds the knowledge of any single individual and can be harnessed by voting.

What makes a crowd wise? Consider this scenario Four elements required to form a wise crow (J. Surowiecki): Diversity of opinion. People in crowd should have a range of experiences, education and opinions. Independence. Prediction by person in crowd is not influenced by other people in the crowd. Decentralization People have specializations and local knowledge. Aggregation. There is a mechanism for aggregating all predictions into one single prediction. Crowd wisdom is best suited for problems that involve optimization, but ill suited for problems that require creativity or innovation. (in You Are Not a Gadget by Jaron Lanier) Ask each person in the same crowd: How much does the pig weigh? 80 96 103 120 90 110 85 118 97 99 75 125 Crowd s prediction: AVERAGE of all predictions. This crowd predicts 99.8333. (The pig weighs 99.8333 kg.) Has crowd made a good estimate? Has crowd made a good estimate? 103 80 96 120 If composition of crowd: 30% EXPERTS. 70% NON-EXPERTS. 103 90 80 110 96 85 120 If crowd contains independent 50 people: 0.12 0.1 P(predicted weight crowd) 90 110 85 and their level of expertise: (Say pig s true weight is 100 kg) 118 99 75 125 97 0.08 0.06 0.04 118 99 75 125 97 0.12 P(predicted weight expert) 0.1 0.08 0.06 0.12 0.1 0.08 0.06 P(predicted weight non-expert) 0.02 0 predicted weight 70 80 90 100 110 120 130 P(pred. weight crowd) 0.04 0.04 0.02 0 predicted weight 70 80 90 100 110 120 130 0.02 0 70 80 90 100 110 120 130 predicted weight P(pred. weight expert) : P(pred. weight non-expert) : N (100, 5 2 ) U(70, 130) On average this crowd will make better estimates than the experts. It is wiser than each of the experts!

But... The crowd must be careful Why not just asking asked a bunch of experts?? Large enough crowd = high probability a sufficient number of experts will be in crowd (for any question). Random selection = don t make a biased choice in experts. For some questions it may be hard to identify a diverse set of experts In the analysis of the crowd it is implicitly assumed: each person is not concerned with the opinions of others, The non-experts will predict a completely random wrong answer - these will somewhat cancel each other out. However, there may be a systematic and consistent bias in the non-experts predictions. If the crowd does not contain sufficient experts then truth by consensus, rather than fact, leads to Wikiality! (Term coined by Stephen Colbert in an episode of the The Colbert Report in July 2006.) Combining classifiers Back to machines Will exploit Wisdom of crowd ideas for specific tasks by combining classifier predictions and aim to combine independent and diverse classifiers. But will use labelled training data to identify the expert classifiers in the pool; to identify complementary classifiers; to indicate how to best combine them.

Ensemble Prediction: Voting Green region is the true boundary. A diverse and complementary set of high-bias classifiers, with performance better than chance, combined by voting ( T ) f V (x) = sign h t (x) t=1 can produce a classifier with a low-bias. h t H where H is a family of possible weak classifiers functions. High-bias classifier Low-bias classifier Example: Voting of oriented hyper-planes can define convex regions. Low model complexity (small # of d.o.f.) = High-bias High model complexity (large # of d.o.f.) = Low-bias Ensemble Learning & Prediction Ensemble Learning & Prediction (cont.) But how can we How? Exploit labelled training data. Train different classifiers using the training data which focus on different subsets of the data. Use a weighted sum of these diversely trained classifiers. define a set of diverse and complementary high-bias classifiers, with non-random performance? combine this set of high-biased classifiers to produce a low-bias classifier able to model a complex boundary (superior to voting)? This approach allows simple high-bias classifiers to be combined to model very complex boundaries.

Binary classification example Ensemble method: Bagging Bootstrap Aggregating Use bootstrap replicates of training set by sampling with replacement. On each replicate learn one model combined altogether. True decision boundary Training data set S i Estimate the true decision boundary with a decision tree trained from some labeled training set S i. Estimated decision boundaries found using bootstrap replicates: See how the decision boundaries on the previous slide differ from the S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 S 10 expected decision boundary of the decision tree classifier (with m = 200 training points). Property of instability

Bagging - Bootstrap Aggregating Output: The bagging estimate for Regression: Input: Training data S = {(x 1, y 1 ),..., (x m, y m )} of inputs x j R d and their labels or real values y j. Iterate: for b = 1,..., B 1 Sample training examples, with replacement, m times from S to create S b. 2 Use this bootstrap sample S b to estimate the regression or classification function f b. Classification: f bag (x) = 1 B f bag (x) = arg max 1 k K B f b (x) b=1 B Ind (f b (x) = k) b=1 Note: Ind (x) = 1 if x = TRUE otherwise Ind (x) = 0 High variance, Low bias classifiers Apply bagging to the original example High variance classifiers produce differing decision boundaries which are highly dependent on the training data. Low bias classifiers produce decision boundaries which on average are good approximations to the true decision boundary. E.g. decision trees Ensemble predictions such as bagging, voting, averaging using diverse high-variance, low-bias classifiers reduce the variance of the ensemble classifier. Ground S 1 Classifier Bagged truth from S 1 classifier B = 100.

Bagging Apply bagging to the original example (cont.) is a procedure to reduce the variance of our classifier when labelled training data is limited. If we bag a high bias, low variance classifier - oriented horizontal and vertical lines - we don t get any benefit. Bias of bagged classifier may be marginally less than the base classifiers. Note: it only produces good results for high variance, low bias classifiers. Ground S 1 Classifier Bagged truth from S 1 classifier B = 100. Ensemble Method: Boosting Ensemble method: Boosting Started from a question: Can a set of weak learners create a single strong classifier where a weak learner performs only slightly better than a chance? (Kearns, 1988) Loop: - Apply learner to weighted samples - Increase weights of misclassified examples Input: Training data S = {(x 1, y 1 ),..., (x m, y m )} of inputs x i and their labels y i { 1, 1} or real values. H: a family of possible weak classifiers/regression functions. Output: A strong classifier/regression function ( T ) f T (x) = sign α t h t (x) t=1 weighted sum of weak classifiers h t H t = 1,..., T α t : confidence/reliability or f T (x) = T α t h t (x) t=1

Ensemble Method: Boosting Binary classification example How?? (Just consider case of classification.) Performance of classifiers h 1,..., h t helps define h t+1. Maintain weight w (t) i for each training example in S. Large w (t) i = x i has greater influence on choice of h t. Iteration t: w (t) i increased if x i wrongly classified by h t. Iteration t: w (t) i decreased if x i correctly classified by h t. Remember: Each h t H True decision boundary Training data H is the set of all possible oriented vertical and horizontal lines. Example Round 1 Adaboost Algorithm (Freund & Schapire, 1997) Given: Labeled training data S = {(x 1, y 1 ),..., (x m, y m )} Chosen weak classifier Re-weight training points Current strong classifier ɛ 1 = 0.19, α 1 = 1.45 w (2) s f i 2 (x) Round 2 of inputs x j R d and their labels y j { 1, 1}. A set/class H of T possible weak classifiers. Initialize: Introduce a weight, w (1) j, for each training example. Set w (1) j = 1 m for each j.

Adaboost Algorithm (cont.) Properties of the Boosting algorithm Iterate: for t = 1,..., T 1 (t) (t) Train classifier ht H using S and w1,..., wm which minimizes the training error: t = m X (t) wj yj ht (xj ) j=1 2 Compute the reliability coefficient: 1 t αt = log t t must be less than 0.5. Break out of loop if t.5 3 Update weights using: (t+1) wj 4 (t) = wj exp( αt yj ht (xj )) Training Error: Training error 0 exponentially. Good Generalization Properties: Would expect over-fitting but even when training error vanishes the test error asymptotes Why? Boosting tries to increase the margin of the training examples even when the training error is zero: ft (x) = sign T X! αt ht (x) = sign (φt (x)) t=1 Margin of a correctly classified example is: yi φt (xi ) The larger the margin = further example is from the decision boundary = better generalization ability. Normalize the weights so that they sum to 1. Example: Viola & Jones Face Detection Viola & Jones: Training data Positive training examples: Image patches corresponding to faces - (xi, 1). Negative training examples: Random image patches from images not containing faces (xj, 1). Note: All patches are re-scaled to Most state-of-the-art face detection on mobile phones, digital cameras etc. are based on this algorithm. Example of a classifier constructed using the Boosting algorithm. have same size. Positive training examples

Viola & Jones: Weak classifier Viola & Jones: Filters Considered Huge Library of Filters Huge library of possible Haar-like filters, f 1,..., f n with n 16, 000, 000. FACE or NON-FACE Input: x Apply filter: f j (x) Output: h(x) = (f j (x) > θ) Filters used compute differences between sums of pixels in adjacent rectangles. (These can be computed very quickly using Integral Images.) Viola and Jones, Robust object detection using a boosted cascade of simple features, CVPR 2001 Viola & Jones: AdaBoost training Viola & Jones: AdaBoost training (cont.) Recap: define weak classifier as { 1 if f jt (x) > θ t h t (x) = 1 otherwise Use AdaBoost to efficiently choose the best weak classifiers and to combine them. Remember: a weak classifier corresponds to a filter type and a threshold. For t = 1,..., T for each filter type j 1 Apply filter, f j, to each example. 2 Sort examples by their filter responses. 3 Select best threshold for this classifier: θ tj. 4 Keep record of error of this classifier: ɛ tj. Select the filter-threshold combination (weak classifier j ) with minimum error. Then set j t = j, ɛ t = ɛ tj and θ t = θ tj. Re-weight examples according to the AdaBoost formualae. Note: (There are many tricks to make this implementation more efficient.)

Viola & Jones: Cascade of classifiers Given a nested set of classifier Viola & Jones: Sliding window % False Pos hypothesis classes 0 50 99 vs false negdetermined by But: only a tiny proportion of the patches will be faces and many of them will not look anything like a face. % Detection Remember: Better classification rates if use a classifier, ft, with large T. Given a new image, I, detect the faces in the image by: for each plausible face size s 50 Exploit this fact: Introduce a cascade of increasingly strong classifiers for each possible patch centre c 1 Extract sub-patch of size s at c from I. 2 Re-scale patch to size of training patches. 3 Apply detector to patch. 4 Keep record of s and c if the detector returns positive. Computational Risk Minimization IMAGE SUB-WINDOW F This is a lot of patches to be examined. If T is very large processing an image will be very slow! NON-FACE 50% F NON-FACE 5 Features 20% 20 Features F NON-FACE F NON-FACE T Classifier 3 T FACE F NON-FACE Viola & Jones: Typical Results Cascaded Classifier 1 Feature Classifier 2 Viola and Jones, Robust object detection using a boosted cascade of simple features, CVPR 2001 Viola & Jones: Cascade of classifiers IMAGE SUB-WINDOW T Classifier 1 2% FACE F NON-FACE A 1 feature classifier achieves 100% detection rate and about 50% false positive rate. A 1 feature classifier achieves 100% detection rate A 5 feature classifier achieves 100% detection rate and andfalse about 50% positive rate. 40% positive ratefalse (20% cumulative) - using data from previous stage. A 5 feature classifier achieves 100% detection rate A 20 feature achieves 100% rate with and 40% classifier false positive rate detection (20% cumulative) 10% false positive rate (2% cumulative). using data from previous stage. A 20 feature classifier achieve 100% detection P. Viola, M. J. Jones, Robust real-time face detection. International Journal of Computer Vision 57(2): 137-154, 2004.

Summary: Ensemble Prediction Can combine many weak classifiers/regressors into a stronger classifier; voting, averaging, bagging if weak classifiers/regressors are better than random. Summary if there is sufficient de-correlation (independence) amongst the weak classifiers/regressors. Can combine many (high-bias) weak classifiers/regressors into a strong classifier; boosting if weak classifiers/regressors are chosen and combined using knowledge of how well they and others performed on the task on training data. The selection and combination encourages the weak classifiers to be complementary, diverse and de-correlated.