Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Similar documents
Learning with multiple models. Boosting.

Learning Ensembles. 293S T. Yang. UCSB, 2017.

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Voting (Ensemble Methods)

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Learning theory. Ensemble methods. Boosting. Boosting: history

Boos$ng Can we make dumb learners smart?

Data Warehousing & Data Mining

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Lecture 8. Instructor: Haipeng Luo

Ensembles. Léon Bottou COS 424 4/8/2010

COMS 4771 Lecture Boosting 1 / 16

Hierarchical Boosting and Filter Generation

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

CS7267 MACHINE LEARNING

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

TDT4173 Machine Learning

A Brief Introduction to Adaboost

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Machine Learning. Ensemble Methods. Manfred Huber

Boosting: Foundations and Algorithms. Rob Schapire

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

TDT4173 Machine Learning

Boosting & Deep Learning

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Ensemble Methods for Machine Learning

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Computational and Statistical Learning Theory

Statistics and learning: Big Data

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

Cross Validation & Ensembling

VBM683 Machine Learning

Statistical Machine Learning from Data

FINAL: CS 6375 (Machine Learning) Fall 2014

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Computational Learning Theory

ECE 5424: Introduction to Machine Learning

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Variance Reduction and Ensemble Methods

An overview of Boosting. Yoav Freund UCSD

Neural Networks and Ensemble Methods for Classification

CS 231A Section 1: Linear Algebra & Probability Review

Bagging and Other Ensemble Methods

Data Mining und Maschinelles Lernen

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

2 Upper-bound of Generalization Error of AdaBoost

Chapter 14 Combining Models

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

ECE 5984: Introduction to Machine Learning

Ensemble Methods: Jay Hyer

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Stochastic Gradient Descent

Ensembles of Classifiers.

Reconnaissance d objetsd et vision artificielle

Lecture 13: Ensemble Methods

PDEEC Machine Learning 2016/17

Introduction to Support Vector Machines

Robotics 2. AdaBoost for People and Place Detection. Kai Arras, Cyrill Stachniss, Maren Bennewitz, Wolfram Burgard

Boosting: Algorithms and Applications

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Adaptive Boosting of Neural Networks for Character Recognition

Machine Learning Algorithms for Classification. Rob Schapire Princeton University. schapire

Totally Corrective Boosting Algorithms that Maximize the Margin

CS534 Machine Learning - Spring Final Exam

Linear Classifiers and the Perceptron

Classifier Complexity and Support Vector Classifiers

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Decision Trees: Overfitting

Algorithm-Independent Learning Issues

1 Rademacher Complexity Bounds

CSCI-567: Machine Learning (Spring 2019)

Machine Learning for Signal Processing Detecting faces in images

IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 1

CS446: Machine Learning Spring Problem Set 4

i=1 = H t 1 (x) + α t h t (x)

Representing Images Detecting faces in images

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Combining Classifiers

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Introduction to Boosting and Joint Boosting

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Machine Learning, Fall 2011: Homework 5

Machine Learning Lecture 7

Data Mining and Analysis: Fundamental Concepts and Algorithms

Boosting. 1 Boosting. 2 AdaBoost. 2.1 Intuition

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Computational Learning Theory

Transcription:

Adaptive Filters and Machine Learning Boosting and Bagging Background Poltayev Rassulzhan rasulzhan@gmail.com Resampling Bootstrap We are using training set and different subsets in order to validate results Can a set of weak classifiers be combined to derive a strong classifier? Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 1 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 2 / 58 Combining models Bootstrap Can a set of weak classifiers be combined to derive a strong classifier? YES We are taking average results from different models Benefits: classification performance will be better than single classifier more resilience (elastic) to noise Minuses models become difficult to explain time consuming The main idea is wisdom of the (simulated) crowd A bootstrap data set is one created by randomly selecting n points from the training set D, with replacement. D itself contains n points, there is nearly always duplication of individual points in a bootstrap data set. In boostrap estimation, selection process is independently repeated B times to yield B bootstrap data set, which are treated as independent set. Boostrap estimate of statistic θ, denoted ˆθ (.) is ˆθ (.) = 1 B ˆθ (b), (1) B b=1 where ˆθ (b) is estimate on boostrap sample b. Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 3 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 4 / 58

The usual statistical problem Statistical Question How wrong it estimate? Task: estimate the population parameter θ using the sample estimate ˆθ Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 5 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 6 / 58 Statistical Answer Bias and Variance assess on variability of ˆθ standard errors, confidence intervals, p-values for hypothesis tests about θ Assess variability of the sample estimate ˆθ by taking additional samples, obtaining new estimates of θ each time. Bias Variance bias boot = 1 B (b) ˆθ B b ˆθ = ˆθ (.) ˆθ (2) b=1 Varboot[θ] = 1 B [ˆθ (b) B ˆθ (.) ] 2 (3) b=1 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 7 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 8 / 58

Outline 1 Background Introduction to boostrap Bootstrap 2 Bagging Introduction to Bagging Algorithm 3 Introduction to problem 4 Boosting Introduction to Boosting AdaBoost algorithm Boosting training error Boosting analog algorithms 5 Bagging and Boosting 6 References History Terms Introduced by Breiman (1996) Bagging stands for bootstrap aggregating. It is an ensemble method: a method of combining multiple predictors. The arcing - adaptive reweighting and combining. It refers to reusing or selecting data in order to improve classification. Bagging - a name derived from bootstrap aggregation - uses multiple versions of a training set, each created by drawing n < n samples from D with replacement. A learning algorithm combination is informally called unstable if small changes in the training data lead to significantly different classifiers and relatively large changes in accuracy. Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 9 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 10 / 58 Bagging Boostrapping Bootstrap Aggregation - Bagging Aggregation Imagine we have m sets of n independent observations S (1) = {(X1, Y1),..., (Xn, Yn)} (1),..., S (m) = {(X1, Y1),..., (Xn, Yn)} (m) all taken iid from same underlying distribution P Traditional approach: generate some ϕ(x, S) from all the data samples Aggregation: learn ϕ(x, S) by averaging ϕ(x, S (k) ) over many k unfortunately we usually have one single observations set S boostrap S to form the S (k) observation sets choose some samples, duplicate them until you fill a new S (i) of the same size of S the samples not used by each set are validation samples Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 11 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 12 / 58

Bagging Details So bagging is bootrap model randomly generate D set of cardinality M from the original set Z with replacement corrects the optimistic bias of R-Method bootstrap aggregation create boostrap samples of a traininig set using sampling with replacement where each boostrap sample is used to train different component of base classifier where classification is done by plurarity voting 1 Traininig phase Initialize the parameters D = 0, the ensemble L, the number of classifier to train 2 For k = 1,...,L Take a boostrap sample Sk, from Z Build a classifier Dk using Sk as the training set Add the classifier to the current ensemble, D = D Dk 3 Return D Classification phase 4 Run D1,...,DL on the input x 5 The class with the maximum number of votes is chosen as the label for x. Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 13 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 14 / 58 Example Example (Cont.) Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 15 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 16 / 58

Example (Cont.) Conclusions from bagging For error in learning is due to noise, bias and variance: noise is error by the target function bias is where the algorithm can not learn the target. variance comes from the sampling, and how it affects the learning algorithm does bagging minimizes these errors? YES!!! averaging over bootstrap samples can reduce error from variance especially in case of unstable classifiers Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 17 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 18 / 58 Problem Problem: Betting strategy Horse-racing gambler Goal Maximize winnings Consider expert algorithm: no initial data with given information rule of thumb Table : Betting startegy (4) rule of thumb Bet on the horse that recently won most races Bet on horse with most favorite odds etc... Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 19 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 20 / 58

Problem: Problem questions In other words algorithm looks like that: choose small subset of data derive rough rule of thumb test second subset of data derive second rule of thumb repeat T-times Problems How choose collections of races presented to expert for extract rules of thumb? How combine all rules into single to make accurate prediction? Answers concentrate on hardest examples, that often misclassified by previous rules of thumb take weighted majority of rules of thumb Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 21 / 58 Introduction to Boosting Boosting - general method for improving accuracy of any given learning algorithm Details assume given weak learning algorithm that can consistently find classifiers ( rules of thumb ) at least slightly better than 51% that is weak learning assumption. given sufficient data, a boosting algorithm can probably construct single classifier with very high accuracy 98-99% Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 22 / 58 PAC learning model What is AdaBoost? Boosting has roots in theoretical machine (PAC) learning model get random examples from unknown, arbtrary distribution Strong and Weak learning algorithm Strong PAC learning algorithm for any distribution with high probability given polynomially many examples can find classifier with arbitrary small generalization error Weak PAC learning model same but generaization error only needs to be slightly better that random guessing ( 1 2 γ) Kearns and Valiant says does weak learnability model make strong learnability? We begin by describing the most popular boosting algorithm due to Freund and Schapire (1997) called AdaBoost.M1. AdaBoost (adaptive boosting) allows the designer to continue adding weak learners until some desired low training error has achieved. AdaBoost focused in on the informative or difficult patterns. AdaBoost is algorithm for constructing a strong classifier as linear combination T f (x) = αt ht(x) (5) of simple weak classifiers ht(x). ht - weak or basis classifier, hypothesis, feature H(x) = sign(f (x)) - strong or final classifier/hypothesis t=1 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 23 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 24 / 58

Do you remember description of boosting? AdaBoost we have training set (x1, y1),...,(xm, ym) yi { 1, +1} correct label of instance xi X for t = 1,...,T: construct distribution Dt on 1,.., m find weak classifier ( rule of thumb ) with small error ɛt on Dt: output final classifier H final ht : X { 1, +1} (6) ɛt = [ht(xi) yi] (7) Pri Dt construction Dt Initialize weights D1(i) = 1 m given Dt and ht: { Dt+1(i) = Dt(i) e αt if yi = ht(xi) Zt e αt if yi ht(xi) where Zt = normalization factor final classifier H final(x)=sign( t αt ht (x)) (8) αt = 1 ɛt ln(1 ) > 0 (9) 2 ɛt Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 25 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 26 / 58 Toy example of Robert Schapire Round 1 In that example, we have weak classifiers vertical half-plane horizontal half-plane (10) (11) Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 27 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 28 / 58

Round 2 Round 3 (12) Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 29 / 58 Final Classifier Poltayev Rassulzhan (CAU) June 4, 2014 30 / 58 June 4, 2014 32 / 58 One more example (13) Poltayev Rassulzhan (CAU) Boosting and Bagging Boosting and Bagging June 4, 2014 31 / 58 From Jiri Matas and Jan Sochman Poltayev Rassulzhan (CAU) Boosting and Bagging

Practice Practice Test http://cseweb.ucsd.edu/ yfreund/adaboost/ Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 33 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 34 / 58 Practice Practice Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 35 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 36 / 58

Practice Practice Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 37 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 38 / 58 Practice Practice Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 39 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 40 / 58

Practice Practice Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 41 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 42 / 58 Training error:theorem Theorem write ɛt as 1/2 γt [γt = edge ] then training error(h final) t [2 ɛt(1 ɛt)] = t 1 4γt 2 exp( 2 t γ2 t ) so: if t: γt γ > 0 then training error(h final) e 2γ2 t T We must understand that AdaBoost is adaptive: does not need to know γ or T a priori can exploit γt γ Training error: Proof We can prof theorem in 3 steps. Step #1 where Proof: DT +1(i) = 1 exp( yif (xi)) (14) m t Zt f (x) = t Unwrapping recurrence, we get that αt ht(x). (15) DT +1(i) = 1 exp( αt yiht(xi)) (16) m t Zt DT +1(i) = D1(i) exp( α1yih1(xi)) exp( αt yiht (xi))...d1(i) (17) Z1 ZT Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 43 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 44 / 58

Training error: Proof Step #2 The training error of final classifier H is at most Prof: training error(h) T Zt (18) t=1 { = 1 1 if yi H(xi) m i { 0 else = 1 1 if yif (xi) 0 m i 0 else 1 m = i DT +1(i) t Zt = t by definition H(x) = sign(f (x)) yi { 1. + 1} exp( yif (xi)) since e z 1 if z 0 i Zt by Step #1 above Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 45 / 58 Training error: Proof Step #3 The last step is to compute Zt We can compute this{ normalization constants as follows: Zt = i Dt(i) e αt if ht(x) = yi e αt if ht(x) yi = Dt(i)e αt + Dt(i)e αt i:ht (x i)=y i i:ht (x i) y i = e αt i:ht (x i)=y i Dt(i) + e αt i:ht (x i) y i Dt(i) = e αt (1 ɛt) + e αt ɛt by definition of ɛt = 2 (1 ɛt)ɛt by our choice of αt = (1 4γt 2 plugging in ɛt = 1 2 γt e 2γ2 t using 1 + x e x for all real x Combining with Step #2 gives the claimed upper bound on the training error of H. THEOREM PROVED Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 46 / 58 Training error: Result Training error theorem { 1 1 if yi H(xi) m i 0 else T t=1 Zt exp( 2 t γ2 ) AdaBoost will achieve zero training error (exponentially fast): Digits recognition Boosting robust for overfititng test error decreases even after training error is zero Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 47 / 58 Generalization error where dt errortrue(h) errortrain(h) + O( m ) (19) T number of boosting rounds d VC dimension of weak learner, measures complexity of classifier m number of training examples Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 48 / 58

Margin Margin We can define margin as follows α1h1(x) +... + αmhm(x) γ(xi) = yi α1 +... + αm where γ(xi) [ 1, +1], positive if H(xi) = yi Iterations of AdaBoost increase the margin of training examples (20) Theory error continues to decrease Margin for an object is related to certainty of its classification. Positive and large margin is correct classification Negative margin is incorrect classification Very small margin is uncertainty in classification Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 49 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 50 / 58 Application Viola-Jones Result Viola-Jones Haar-Like wavelets Millions of possible classifiers etc... Note I(x) is pixel of image I at position x 2 rectangles of pixels f (x) = I(x) I(x) (21) x A x B ϕ1(x) ϕ2(x) ϕ(x) = { 1 if f (x) > 0, 1 otherwise (22) Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 51 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 52 / 58

Review SVM Weak Learners don t need to be weak! SVM with kernel K where 0 αi C N max αi 1 N αiαjyiyjk (xixj) (23) 2 i=1 i,j=1 Classification of x: ŷ = sign(ŵ0 + α i>0 αiyik (xi, x)) A positive-definie kernel corresponds to dot product in feature space 20 boosted SVMs with 5 SVs and the RBF kernel Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 53 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 54 / 58 Similarity Bagging and Boosting So we understand that boosting has similar idea with combine classifiers: given hypothesis functions h1(x),..., hm(x) H(x) = α1h1(x) +... + αmhm(x), (24) αi is the vote assigned to classifier hi. Prediction: ŷ(x) = sign H(x) (25) Classifier hi can be simple (e.g. based on single feature). Bagging: linear combination of multiple learners Very robust to noise A lot of redundant effort Boosting: weighed combination of arbitrary Very strong learner from very simple ones Sensitive to noise Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 55 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 56 / 58

Bagging and Boosting (Cont.) References Bagging: each model is trained independently Boosting: each model is built on top of the previous ones Richard O. Duda, Peter E. Hart and David G. Stork. Pattern Classification Jiri Matas and Jan Sochman. AdaBoost Bishop. Boosting Robert E. Schapire. Boosting Robert E. Schapire and Yoav Freund. A Short Introduction to Boosting Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 57 / 58 Poltayev Rassulzhan (CAU) Boosting and Bagging June 4, 2014 58 / 58