Learning theory. Ensemble methods. Boosting. Boosting: history

Similar documents
COMS 4771 Lecture Boosting 1 / 16

Boosting: Foundations and Algorithms. Rob Schapire

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Decision Trees: Overfitting

Ensembles. Léon Bottou COS 424 4/8/2010

Hierarchical Boosting and Filter Generation

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Voting (Ensemble Methods)

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

ECE 5424: Introduction to Machine Learning

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Statistical Machine Learning from Data

Statistics and learning: Big Data

Lecture 8. Instructor: Haipeng Luo

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Ensemble Methods for Machine Learning

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Machine Learning, Fall 2011: Homework 5

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

ECE 5984: Introduction to Machine Learning

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

CS 231A Section 1: Linear Algebra & Probability Review

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Stochastic Gradient Descent

CS229 Supplemental Lecture notes

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Neural Networks and Ensemble Methods for Classification

Decision trees COMS 4771

VBM683 Machine Learning

CS7267 MACHINE LEARNING

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Boosting: Algorithms and Applications

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Optimal and Adaptive Algorithms for Online Boosting

Boosting & Deep Learning

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Computational and Statistical Learning Theory

2 Upper-bound of Generalization Error of AdaBoost

Gradient Boosting (Continued)

TDT4173 Machine Learning

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Learning with multiple models. Boosting.

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Generalized Boosted Models: A guide to the gbm package

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Ensembles of Classifiers.

CSCI-567: Machine Learning (Spring 2019)

Boos$ng Can we make dumb learners smart?

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Online Learning and Sequential Decision Making

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Machine Learning. Lecture 9: Learning Theory. Feng Li.

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

Announcements Kevin Jamieson

Generalization, Overfitting, and Model Selection

Boosting Methods for Regression

Advanced Machine Learning

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Name (NetID): (1 Point)

A Brief Introduction to Adaboost

Why does boosting work from a statistical view

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Data Warehousing & Data Mining

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Variance Reduction and Ensemble Methods

Lecture 13: Ensemble Methods

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

Data Mining und Maschinelles Lernen

18.9 SUPPORT VECTOR MACHINES

PDEEC Machine Learning 2016/17

Totally Corrective Boosting Algorithms that Maximize the Margin

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

COMS 4771 Introduction to Machine Learning. Nakul Verma

A Simple Algorithm for Learning Stable Machines

Multiclass Boosting with Repartitioning

Machine Learning Lecture 7

Qualifying Exam in Machine Learning

6.036 midterm review. Wednesday, March 18, 15

TDT4173 Machine Learning

Minimax risk bounds for linear threshold functions

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Algorithm-Independent Learning Issues

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

the tree till a class assignment is reached

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Lossless Online Bayesian Bagging

Ensemble Methods: Jay Hyer

Machine Learning for NLP

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

Transcription:

Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over random choice of S), learn a classifier ˆf : X { 1, 1} with low error rate err( ˆf) = P ( ˆf(X) Y ) ɛ. Basic question: When is this possible? Suppose I even promise you that there is a perfect classifier from a particular function class F. (E.g., F = linear classifiers or F = decision trees.) Default: Empirical Risk Minimization (i.e., pick classifier from F with lowest training error rate), but this might be computationally difficult (e.g., for decision trees). Another question: Is it easier to learn just non-trivial classifiers in F (i.e., better than random guessing)? 1 / 30 2 / 30 Boosting Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. Motivation: Easy to construct classification rules that are correct more-often-than-not (e.g., If % of the e-mail characters are dollar signs, then it s spam. ), but seems hard to find a single rule that is almost always correct. Basic idea: Input: training data S For t = 1, 2,..., T : 1. Choose subset of examples S t S (or a distribution over S). 2. Use weak learning algorithm to get classifier: f t := WL(S t ). Return an ensemble classifier based on f 1, f 2,..., f T. Boosting: history 1984 Valiant and Kearns ask whether boosting is theoretically possible (formalized in the PAC learning model). 1989 Schapire creates first boosting algorithm, solving the open problem of Valiant and Kearns. 1990 Freund creates an optimal boosting algorithm (Boost-by-majority). 1992 Drucker, Schapire, and Simard empirically observe practical limitations of early boosting algorithms. 199 Freund and Schapire create AdaBoost a boosting algorithm with practical advantages over early boosting algorithms. Winner of 2004 ACM Paris Kanellakis Award: For their seminal work and distinguished contributions [...] to the development of the theory and practice of boosting, a general and provably effective method of producing arbitrarily accurate prediction rules by combining weak learning rules ; specifically, for AdaBoost, which can be used to significantly reduce the error of algorithms used in statistical analysis, spam filtering, fraud detection, optical character recognition, and market segmentation, among other applications. 3 / 30 4 / 30

AdaBoost Interpretation input Training data {(x i, y i )} n i=1 from X { 1, 1}. 1: initialize D 1 (i) := 1/n for each i = 1, 2,..., n (a probability distribution). 2: for t = 1, 2,..., T do 3: Give D t -weighted examples to WL; get back f t : X { 1, 1}. 4: Update weights: z t := n D t (i) y i f t (x i ) [ 1, 1] i=1 α t := 1 2 ln 1 z t 1 z t R (weight of f t ) D t1 (i) := D t (i) exp ( α t y i f t (x i ) ) /Z t for each i = 1, 2,..., n, where Z t > 0 is normalizer that makes D t1 a probability distribution. : end for 6: return Final classifier ˆf(x) := sign α t f t (x). Interpreting z t Suppose (X, Y ) D t. If then z t = P (f(x) = Y ) = 1 2 γ t, n D t (i) y i f(x i ) = 2γ t [ 1, 1]. i=1 z t = 0 random guessing w.r.t. D t. z t > 0 better than random guessing w.r.t. D t. z t < 0 better off using the opposite of f s predictions. (Let sign(z) := 1 if z > 0 and sign(z) := 1 if z 0.) / 30 6 / 30 Interpretation Example: AdaBoost with decision stumps Classifier weights α t = 1 2 ln 1z t 1 z t 3 2 1 α t 0-1 -2-3 -1-0. 0 0. 1 z t Example weights D t1 (i) D t1 (i) D t (i) exp( α t y i f t (x i )). Weak learning algorithm WL: ERM with F = decision stumps on R 2 (i.e., axis-aligned threshold functions x sign(vx i t)). Straightforward to handle importance weights in ERM. (Example from Figures 1.1 and 1.2 of Schapire & Freund text.) 7 / 30 8 / 30

Example: execution of AdaBoost Example: final classifier from AdaBoost D 1 D 2 D 3 f 1 f 2 f 3 z 1 = 0.40, α 1 = 0.42 z 2 = 0.8, α 2 = 0.6 z 3 = 0.72, α 3 = 0.92 Final classifier ˆf(x) = sign(0.42f 1 (x) 0.6f 2 (x) 0.92f 3 (x)) f 1 f 2 f 3 z 1 = 0.40, α 1 = 0.42 z 2 = 0.8, α 2 = 0.6 z 3 = 0.72, α 3 = 0.92 (Zero training error rate!) 9 / 30 10 / 30 Empirical results UCI UCI Results Test error rates of C4. and AdaBoost on several classification problems. Each point represents a single classification problem/dataset from UCI repository. Training error rate of final classifier Recall γ t := P (f t (X) = Y ) 1/2 = z t /2 when (X, Y ) D t. C4. C4. 30 30 2 2 20 20 1 1 10 10 0 0 0 0 10 101 120 202 230 30 C4. C4. 30 30 2 2 20 20 1 1 10 10 0 0 0 0 10 101 120 202 230 30 AdaBooststumps AdaBoostC4. boosting Stumps boosting C4. C4. C4. = popular algorithm for learning decision trees. Training error rate of final classifier from AdaBoost: err( ˆf, {(x i, y i )} n i=1) exp 2 If average γ 2 ( ) := 1 T T γ2 t > 0, then training error rate is exp 2 γ 2 T. AdaBoost = Adaptive Boosting Some γ t could be small, even negative only care about overall average γ 2. What about true error rate? γ 2 t. (Figure 1.3 from Schapire & Freund text.) 11 / 30 12 / 30

Combining classifiers Let F be the function class used by the weak learning algorithm WL. The function class used by AdaBoost is F T := x sign α t f t (x) : f 1, f 2,..., f T F, α 1, α 2,..., α T R i.e., linear combinations of T functions from F. Complexity of F T grows linearly with T. Theoretical guarantee (e.g., when F = decision stumps in R d ): With high probability (over random choice of training sample), err( ˆf) ( ) ( ) exp 2 γ 2 T log d T O. }{{} n }{{} Training error rate Error due to finite sample Theory suggests danger of overfitting when T is very large. Indeed, this does happen sometimes... but often not! A typical run of boosting AdaBoostC4. on letters dataset. Error rate 20 1 10 0 C4. test error AdaBoost test error AdaBoost training error 10 100 1000 # of rounds T (# nodes across all decision trees in ˆf is >2 10 6 ) Training error rate is zero after just five rounds, but test error rate continues to decrease, even up to 1000 rounds! (Figure 1.7 from Schapire & Freund text) 13 / 30 14 / 30 Boosting the margin Final classifier from AdaBoost: ( T ) ˆf(x) = sign α tf t (x) T α. t }{{} g(x) [ 1, 1] Call y g(x) [ 1, 1] the margin achieved on example (x, y). New theory [Schapire, Freund, Bartlett, and Lee, 1998]: Larger margins better resistance to overfitting, independent of T. AdaBoost tends to increase margins on training examples. (Similar but not the same as SVM margins.) On letters dataset: T = T = 100 T = 1000 training error rate 0.0% 0.0% 0.0% test error rate 8.4% 3.3% 3.1% % margins 0. 7.7% 0.0% 0.0% min. margin 0.14 0.2 0. Linear classifiers Regard function class F used by weak learning algorithm as feature functions : x φ(x) := ( f(x) : f F ) { 1, 1} F (possibly infinite dimensional!). AdaBoost s final classifier is a linear classifier in { 1, 1} F : ˆf(x) = sign α t f t (x) = sign w f f(x) = sign ( w, φ(x) ) f F where w f := α t 1{f t = f} f F. 1 / 30 16 / 30

Exponential loss AdaBoost is a particular coordinate descent algorithm (similar to but not the same as gradient descent) for 1 min w R F n 3 2. 2 1. 1 0. 0 n exp ( y i w, φ(x i ) ). i=1 exp(-x) log(1exp(-x)) More on boosting Many variants of boosting: AdaBoost with different loss functions. Boosted decision trees = boosting decision trees. Boosting algorithms for ranking and multi-class. Boosting algorithms that are robust to certain kinds of noise. Boosting for online learning algorithms (very new!).... Many connections between boosting and other subjects: Game theory, online learning Geometry of information (replace 2 2 with relative entropy divergence) Computational complexity... -1-0. 0 0. 1 1. 17 / 30 18 / 30 Application: face detection Face detection Problem: Given an image, locate all of the faces. An application of AdaBoost As a classification problem: Divide up images into patches (at varying scales, e.g., 24 24, 48 48). Learn classifier f : X Y, where Y = {face, not face}. Many other things built on top of face detectors (e.g., face tracking, face recognizers); now in every digital camera and iphoto/picasa-like software. Main problem: how to make this very fast. 20 / 30

Face detectors via AdaBoost [Viola & Jones, 2001] Face detector architecture by Viola & Jones (2001): major achievement in computer vision; detector actually usable in real-time. Think of each image patch (d d-pixel gray-scale) as a vector x [0, 1] d2. Used weak learning algorithm that picks linear classifiers f w,t (x) = sign( w, x t), where w has a very particular form: Viola & Jones integral image trick Integral image trick: (0, 0) For every image, pre-compute (r, c) s(r, c) = sum of pixel values in rectangle from (0, 0) to (r, c) (single pass through image). To compute inner product w, x = average pixel value in black box average pixel value in white box just need to add and subtract a few s(r, c) values. AdaBoost combines many rules-of-thumb of this form. Very simple. Extremely fast to evaluate via pre-computation. Evaluating rules-of-thumb classifiers is extremely fast. 21 / 30 22 / 30 Viola & Jones cascade architecture Viola & Jones detector: example results Problem: severe class imbalance (most patches don t contain a face). Solution: Train several classifiers (each using AdaBoost), and arrange in a special kind of decision list called a cascade: x f (1) 1 (x) f (2) 1 (x) f (3) (x) 1 1 1 1 1 1 1 Each f (l) is trained (using AdaBoost), adjust threshold (before passing through sign) to minimize false negative rate. Can make f (l) in later stages more complex than in earlier stages, since most examples don t make it to the end. (Cascade) classifier evaluation extremely fast. 23 / 30 24 / 30

Summary Two key points: AdaBoost effectively combines many fast-to-evaluate weak classifiers. Cascade structure optimizes speed for common case. Bagging 2 / 30 Bagging Aside: sampling with replacement Question: if n individuals are picked from a population of size n u.a.r. with replacement, what is the probability that a given individual is not picked? Bagging = Bootstrap aggregating (Leo Breiman, 1994). Input: training data {(x i, y i )} n i=1 from X { 1, 1}. For t = 1, 2,..., T : 1. Randomly pick n examples with replacement from training data {(x (t) i, y (t) i )} n i=1 (a bootstrap sample). 2. Run learning algorithm on {(x (t) i, y (t) i )} n i=1 classifier f t. Return a majority vote classifier over f 1, f 2,..., f T. Answer: ( 1 1 ) n n For large n: Implications for bagging: ( lim 1 1 ) n = 1 n n e 0.3679. Each bootstrap sample contains about 63% of the data set. Remaining 37% can be used to estimate error rate of classifier trained on the bootstrap sample. Can average across bootstrap samples to get estimate of bagged classifier s error rate (sort of). 27 / 30 28 / 30

Random Forests Key takeaways Random Forests (Leo Breiman, 2001). Input: training data {(x i, y i )} n i=1 from R d { 1, 1}. For t = 1, 2,..., T : 1. Randomly pick n examples with replacement from training data {(x (t) i, y (t) i )} n i=1 (a bootstrap sample). 2. Run variant of decision tree learning algorithm on {(x (t) i, y (t) i )} n i=1, where each split is chosen by only considering a random subset of d features (rather than all d features) decision tree classifier f t. Return a majority vote classifier over f 1, f 2,..., f T. 1. Theoretical concept of weak and strong learning. 2. AdaBoost algorithm; concept of margins in boosting. 3. Interpreting AdaBoost s final classifier as a linear classifier, and interpreting AdaBoost as a coordinate descent algorithm. 4. Structure of decision lists / cascades.. Concept of bootstrap samples; bagging and random forests. 29 / 30 30 / 30