Combining Classifiers

Similar documents
Boosting with log-loss

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Ensemble Based on Data Envelopment Analysis

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

Kernel Methods and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Support Vector Machines MIT Course Notes Cynthia Rudin

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

1 Proof of learning bounds

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

1 Bounding the Margin

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Bayes Decision Rule and Naïve Bayes Classifier

Pattern Recognition and Machine Learning. Artificial Neural networks

1 Rademacher Complexity Bounds

E. Alpaydın AERFAISS

Support Vector Machines. Goals for the lecture

ECE 5424: Introduction to Machine Learning

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Estimating Parameters for a Gaussian pdf

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

A Simple Regression Problem

PAC-Bayesian Learning of Linear Classifiers

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION

Pattern Recognition and Machine Learning. Artificial Neural networks

Learning with multiple models. Boosting.

ECE 5984: Introduction to Machine Learning

Data Mining und Maschinelles Lernen

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

PAC-Bayes Analysis Of Maximum Entropy Learning

Machine Learning: Fisher s Linear Discriminant. Lecture 05

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

VBM683 Machine Learning

Non-Parametric Non-Line-of-Sight Identification 1

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Tracking using CONDENSATION: Conditional Density Propagation

1 Generalization bounds based on Rademacher complexity

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Computational and Statistical Learning Theory

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

Ensembles of Classifiers.

Handout 7. and Pr [M(x) = χ L (x) M(x) =? ] = 1.

Deep Boosting. Abstract. 1. Introduction

Voting (Ensemble Methods)

CS7267 MACHINE LEARNING

Support Vector Machines. Maximizing the Margin

Feature Extraction Techniques

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Topic 5a Introduction to Curve Fitting & Linear Regression

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

Block designs and statistics

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Fixed-to-Variable Length Distribution Matching

Domain-Adversarial Neural Networks

Probability Distributions

Ensembles. Léon Bottou COS 424 4/8/2010

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Announcements Kevin Jamieson

Multiple Instance Learning with Query Bags

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

In this chapter, we consider several graph-theoretic and probabilistic models

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

Machine Learning Basics: Estimators, Bias and Variance

3.3 Variational Characterization of Singular Values

Bootstrapping Dependent Data

Randomized Recovery for Boolean Compressed Sensing

Training an RBM: Contrastive Divergence. Sargur N. Srihari

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Soft-margin SVM can address linearly separable problems with outliers

Interactive Markov Models of Evolutionary Algorithms

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics

Extension of CSRSM for the Parametric Study of the Face Stability of Pressurized Tunnels

Computational and Statistical Learning Theory

Principal Components Analysis

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Lecture 21. Interior Point Methods Setup and Algorithm

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

Analyzing Simulation Results

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

The Weierstrass Approximation Theorem

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

Lecture 8. Instructor: Haipeng Luo

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

An improved self-adaptive harmony search algorithm for joint replenishment problems

COS 424: Interacting with Data. Written Exercises

arxiv: v1 [cs.ds] 3 Feb 2014

Lecture 9: Multi Kernel SVM

Robustness and Regularization of Support Vector Machines

ZISC Neural Network Base Indicator for Classification Complexity Estimation

Efficient Filter Banks And Interpolators

Transcription:

Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/ Bulletin Board "Is there a book available on boosting?" Stacking "eta-learn" which classifier does well where Error-correcting codes going fro binary to ulti-class probles

Why Cobine Classifiers? Prob of error Cobine several classifiers to produce a ore accurate single classifier If C 2 and C 3 are correct where C 1 is wrong, etc, ajority vote will do better than each C i individually Suppose each C i has error rate p<0.5 errors of different C i are uncorrelated Then Pr(r out of n classifiers are wrong) = n B r p r (1-p) n-r Pr(ajority of n classifiers are wrong) = right-half of binoial distribution is sall if: n is large p is sall 1 2 r=nuber of classifers n

Bagging "Bootstrap aggregation" Bootstrap estiation - generate data set by randoly selecting fro training set with replaceent (soe points ay repeat) repeat B ties use as estiate the average of individual estiates Bagging generate B equal size training sets each training set is drawn randoly, with replaceent, fro the data is used to generate a different coponent classifier f i usually using sae algorith (e.g. decision tree) final classifier decides by voting aong coponent classifiers Leo Breian, 1996.

Bagging (contd) Suppose there are k classes Each f i (x) predicts 1 of the classes Equivalently, f i (x) = (0, 0,..., 0, 1, 0,..., 0) Define f bag (x) = (1/B) B i=1 f i (x) = (p 1 (x),..., p k (x)), p j (x) = proportion of f i predicting class j at x Bagged prediction is arg ax k f bag (x) Reduces variance always (provable) for squared-error not always for classification (0/1 loss) In practice usually ost effective if classifiers are "unstable" - depend sensitively on training points. However ay lose interpretability a bagged decision tree is not a single decision tree

Boosting Generate the coponent classifiers so that each does well where the previous ones do badly Train classifier C 1 using (soe part of) the training data Train classifier C 2 so that it perfors well on points where C 1 perfors badly Train classifer C 3 to perfor well on data classified badly by C 1 and C 2, etc. Overall classifier C classifies by weighted voting aong the coponent classifiers C i The sae algorith is used to generate each C i - only the data used for training changes

AdaBoost "Adaptive Boosting" Give each training point (x i, y i =!1) ) in D a weight w i (initialized uniforly) Repeat: Draw a training set D at rando fro D according to the weights w i Generate classifier C using training set D Measure error of C on D Increase weights of isclassified training points Decrease weights of correctly classified points Overall classification is deterined by C boost (x) = Sign( C (x)), where easures the "quality" of C Terinate when C boost (x) has low error

AdaBoost (Details) Initialize weights uniforly: w i Repeat for =1,2,..., M 1 = 1/N (N=training set size) Draw rando training set D fro D according to weights w i Train classifier C using training set D Copute err = Pr i~d [C (x i )! y i ] error rate of C on (weighted) training points Copute =0.5 log((1-err )/err ) = 0 when err = 0.5 -> as err ->0 * w i = w i exp( ) = w i ª(1-err(1-err )/err if x i is incorrectly classified w i exp(- ) = w i ªerr /(1-err ) if x i is correctly classified +1 * w i = w i /Z * +1 Z = i w i is a noralization factor so that i w i =1 Overall classification is deterined by C boost (x) = Sign( C (x))

Theory If: each coponent classifier C is a "weak learner" perfors better than rando chance (err <0.5) Then: the TRAINING SET ERROR of C boost can be ade arbitrarily sall as M (the nuber of boosting rounds) -> Proof (see Later) Probabilistic bounds on the TEST SET ERROR can be obtained as a function of training set error, saple size, nuber of boosting rounds, and "coplexity" of the classifiers C If Bayes Risk is high, it ay becoe ipossible to continually find C which perfor better than chance. "In theory theory and practice are the sae, but in practice they are different"

Practice Use an independent test set to deterine stopping point Boosting perfors very well in practice Fast Boosting decision "stups" is copetitive with decision trees Test set error ay continue to fall even after training set error=0 Does not (usually) overfit Soeties vulnerable to outliers/noise Result ay be difficult to interpret "AdaBoost with trees is the best off-the-shelf classifier in the world" - Breian, 1996. test set error training set error

History Robert Schapire, 1989 Weak classifier could be boosted Yoav Freund, 1995 Boost by cobining any weak classifiers Required bound on error rate of weak classifier Freund & Schapire, 1996 AdaBoost - adapts weights based on error rate of weak classifier Many extensions since then Boosting Decision Trees, Naive Bayes,... More robust to noise Iproving interpretability of boosted classifier Incorporating prior knowledge Extending to ulti-class case "Balancing between Boosting and Bagging using Buping"...

Proof Clai: If err <0.5 for all, then Training Set Error of C boost ->0 as M-> Note: y i C (x i ) = 1 if x i is correctly classified by C = -1 if x i is incorrectly classified by C, siilarly for C boost (x) = sign( C (x)) Training Set Error of classifer C boost (x) is err boost = {i:c boost (x i )! y i } /N C boost (x i )! y i if and only if y i C (x)<0 if and only if -y i C (x))>0 Hence C boost (x i )! y i e exp(-y i C (x))>1 so err boost < [ i exp(-y i C (x))]/n +1 By definition, w i = w i exp(-y i C (x)) /Z +1 So exp(-y i C (x)) ) = Z w i /w i Now insert the "su" into the exponential: exp(-y i C (x)) = exp(-y i C (x)) +1 = Z w i /w i M+1 1 = w i /w i Z M+1 = Nw i Z

Proof (continued) Thus [ i exp(-y i C (x))]/n = i w i M+1 Z = Z M+1 because i w i =1 (having been noralized by Z M ) Nothing has been said so far about the choice of Set =0.5 log((1-err )/err ) * Then w i = w i ª(1-err )/err if x i is incorrectly classified w i ªerr /(1-err ) if x i is correctly classified To noralize, set Z = i w i = i w i * [err (ª(1-err(1-err )/err ) + (1-err )ªerr /(1-err )] = i w i [ªerr (1-err ) + ªerr (1-err )] = 2ªerr2 (1-err ) because i w i =1 So err boost < [ i exp(-y i C (x))]/n = Z = 2ªerr (1-err ) NOTE: D, H & S, pg 479, says err boost = 2ªerr (1-err )

Proof (continued) Let = 0.5 - err > 0 for all is the "edge" of C over rando guessing Then 2ªerr2 (1-err ) = 2ª(0.5- )(0.5+ ) 2 = ª1-4 2 So err boost < ª1-4 2 < (1-2 ) since (1-x) 0.5 = 1-0.5x-... 2 < exp(-2-2 ) since 1+x < exp(x) 2 = exp(-2 ) If: > >0 for all Then err boost < exp(-2 2 ) = exp(-2m 2 ) which tends to zero exponentially fast as M->

Why Boosting Works "The success of boosting is really not very ysterious." - Jeroe Friedan, 2000. Additive odels: f(x) = b(x; ) Classify using Sign(f(x)) b = "basis" function paraetrized by are weights Exaples: neural networks b = activation function, = input-to-hidden weights support vector achines b = kernel function, appropriately paraetrized boosting b = weak classifier, appropriately paraetrized

Fitting Additive Models To fit f(x) = b(x; ), usually, are found by iniizing a loss function (e.g. squared error) over the training set Forward Stagewise fitting: Add new basis functions to the expansion one-by-one Do not odify previous ters Algorith: f 0 (x) = 0 For =1 to M: Find, by in, i L(y i,f -1 (x)+ b(x i ; )) Set f (x) = f -1 (x) + b(x; ) AdaBoost is Forward Stagewise fitting applied to the weak classifier with an EXPONENTIAL loss function

AdaBoost (Derivation) L(y,f(x)) = exp(-yf(x)) exponential loss,c = arg in,c i exp(-y i (f -1 (x i )+ C(x i ))) = arg in,c i exp(-y i (f -1 (x i )))exp(- y i C(x i )) = arg in,c i w i exp(- y i C(x i )) where w i = exp(-y i (f -1 (x i ))) w i depends on neither nor C. Note: i w i exp(- y i C(x i )) = e - yi=c(xi) w i +e yi!c(xi) w i = e - i w i +(e -e - ) i w i Ind(y i!c(x i )) For >0, pick C = arg in C i w i Ind(y i!c(x i )) = arg in C err

AdaBoost (Derivation) (continued) Substitute back: yields e - i w i +(e -e - )err a function of only arg in e - i w i +(e -e - )err can be found differentiate, etc - Exercise! giving =0.5log((1-err )/err The odel update is: f (x) = f -1 (x) + C (x i ) +1 w i = exp(-y i (f (x i ))) = exp(-y i (f -1 (x i ) + C (x i ))) = exp(-y i (f -1 (x i )))exp(-y i C (x i )) = w i exp(- y i C (x i )) deriving the weight update rule.

Exponential Loss L 1 (y,f(x)) = exp(-yf(x)) exponential loss L 2 (y,f(x)) = Ind(yf(x)<0) 0/1 loss L 3 (y,f(x)) = (y-f(x)) 2 squared error 1 L 2 L 1 L 3 0 1 yf(x) (= unnoralized argin) Exponential loss puts heavy weight on exaples with large negative argin These are difficult, atypical, training points - boosting is sensitive to outliers

Boosting and SVMs The argin of (x i, y i ) is (y i C (x i ))/ = y i ( *C(x i ))/y y lies between -1 and 1 >0 if and only if x i is classified correctly Large argins on the training set yield better bounds on generalization error It can be argued that boosting attepts to (approxiately) axiize the iniu argin ax in i y i ( *C(x i ))/y y sae expression as SVM, but 1-nor instead of 2-nor

Stacking Stacking = "stacked generalization" Usually used to cobine odels l 1,..., l r of different types e.g. l 1 =neural network, l 2 =decision tree, l 3 =Naive Bayes,... Use a "eta-learner" L to learn which classifier is best where Let x be an instance for the coponent learners Training instance for L is of the for (l 1 (x),..., l r (x)), l i (x) = class predicted by classifier l i OR (l 11 (x),..., l 1k (x),..., l r1 (x)..., l rk (x)), l ij (x) = probability x is in class j according to classifier l i

Stacking (continued) What should class label for L be? actual label fro data ay prefer classifiers that overfit use a "hold-out" data set which is not used to train the l 1,..., l r wastes data use cross-validation when x occurs in the test set, use it as a training instance for L coputationally expensive Use siple linear odels for L David Wolpert, 1992.

Error-correcting Codes Using binary classifiers to predict ulti-class proble Generate one binary classifier C i for each class vs every other class class C 1 C 2 C 3 C 4 class C 1 C 2 C 3 C 4 C 5 C 6 C 7 a 1 0 0 0 a 1 1 1 1 1 1 1 b 0 1 0 0 b 0 0 0 0 1 1 1 c 0 0 1 0 c 0 0 1 1 0 0 1 d 0 0 0 1 d 0 1 0 1 0 1 0 Each binary classifier C i predicts the i th bit LHS: Predictions like "1 0 1 0" cannot be "decoded" RHS: Predictions like "1 0 1 1 1 1 1" are class "a" (C 2 ade a istake)

Haing Distance Haing distance H between codewords = nuber of single-bit corrections needed to convert one into the other H(1000,0100) = 2 H(1111111,0000111) = 4 (d-1)/2 single-bit errors can be corrected if d=inuu Haing distance between any pair of code-words LHS: d=2 No error-correction RHS: d=4 Corrects all single-bit errors To Dietterich and Ghulu Bakiri, 1995.