Learning theory. Ensemble methods. Boosting. Boosting: history

Save this PDF as:

Size: px
Start display at page:

Download "Learning theory. Ensemble methods. Boosting. Boosting: history"

Transcription

1 Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over random choice of S), learn a classifier ˆf : X { 1, 1} with low error rate err( ˆf) = P ( ˆf(X) Y ) ɛ. Basic question: When is this possible? Suppose I even promise you that there is a perfect classifier from a particular function class F. (E.g., F = linear classifiers or F = decision trees.) Default: Empirical Risk Minimization (i.e., pick classifier from F with lowest training error rate), but this might be computationally difficult (e.g., for decision trees). Another question: Is it easier to learn just non-trivial classifiers in F (i.e., better than random guessing)? 1 / 30 2 / 30 Boosting Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. Motivation: Easy to construct classification rules that are correct more-often-than-not (e.g., If % of the characters are dollar signs, then it s spam. ), but seems hard to find a single rule that is almost always correct. Basic idea: Input: training data S For t = 1, 2,..., T : 1. Choose subset of examples S t S (or a distribution over S). 2. Use weak learning algorithm to get classifier: f t := WL(S t ). Return an ensemble classifier based on f 1, f 2,..., f T. Boosting: history 1984 Valiant and Kearns ask whether boosting is theoretically possible (formalized in the PAC learning model) Schapire creates first boosting algorithm, solving the open problem of Valiant and Kearns Freund creates an optimal boosting algorithm (Boost-by-majority) Drucker, Schapire, and Simard empirically observe practical limitations of early boosting algorithms. 199 Freund and Schapire create AdaBoost a boosting algorithm with practical advantages over early boosting algorithms. Winner of 2004 ACM Paris Kanellakis Award: For their seminal work and distinguished contributions [...] to the development of the theory and practice of boosting, a general and provably effective method of producing arbitrarily accurate prediction rules by combining weak learning rules ; specifically, for AdaBoost, which can be used to significantly reduce the error of algorithms used in statistical analysis, spam filtering, fraud detection, optical character recognition, and market segmentation, among other applications. 3 / 30 4 / 30

2 AdaBoost Interpretation input Training data {(x i, y i )} n i=1 from X { 1, 1}. 1: initialize D 1 (i) := 1/n for each i = 1, 2,..., n (a probability distribution). 2: for t = 1, 2,..., T do 3: Give D t -weighted examples to WL; get back f t : X { 1, 1}. 4: Update weights: z t := n D t (i) y i f t (x i ) [ 1, 1] i=1 α t := 1 2 ln 1 z t 1 z t R (weight of f t ) D t1 (i) := D t (i) exp ( α t y i f t (x i ) ) /Z t for each i = 1, 2,..., n, where Z t > 0 is normalizer that makes D t1 a probability distribution. : end for 6: return Final classifier ˆf(x) := sign α t f t (x). Interpreting z t Suppose (X, Y ) D t. If then z t = P (f(x) = Y ) = 1 2 γ t, n D t (i) y i f(x i ) = 2γ t [ 1, 1]. i=1 z t = 0 random guessing w.r.t. D t. z t > 0 better than random guessing w.r.t. D t. z t < 0 better off using the opposite of f s predictions. (Let sign(z) := 1 if z > 0 and sign(z) := 1 if z 0.) / 30 6 / 30 Interpretation Example: AdaBoost with decision stumps Classifier weights α t = 1 2 ln 1z t 1 z t α t z t Example weights D t1 (i) D t1 (i) D t (i) exp( α t y i f t (x i )). Weak learning algorithm WL: ERM with F = decision stumps on R 2 (i.e., axis-aligned threshold functions x sign(vx i t)). Straightforward to handle importance weights in ERM. (Example from Figures 1.1 and 1.2 of Schapire & Freund text.) 7 / 30 8 / 30

3 Example: execution of AdaBoost Example: final classifier from AdaBoost D 1 D 2 D 3 f 1 f 2 f 3 z 1 = 0.40, α 1 = 0.42 z 2 = 0.8, α 2 = 0.6 z 3 = 0.72, α 3 = 0.92 Final classifier ˆf(x) = sign(0.42f 1 (x) 0.6f 2 (x) 0.92f 3 (x)) f 1 f 2 f 3 z 1 = 0.40, α 1 = 0.42 z 2 = 0.8, α 2 = 0.6 z 3 = 0.72, α 3 = 0.92 (Zero training error rate!) 9 / / 30 Empirical results UCI UCI Results Test error rates of C4. and AdaBoost on several classification problems. Each point represents a single classification problem/dataset from UCI repository. Training error rate of final classifier Recall γ t := P (f t (X) = Y ) 1/2 = z t /2 when (X, Y ) D t. C4. C C4. C AdaBooststumps AdaBoostC4. boosting Stumps boosting C4. C4. C4. = popular algorithm for learning decision trees. Training error rate of final classifier from AdaBoost: err( ˆf, {(x i, y i )} n i=1) exp 2 If average γ 2 ( ) := 1 T T γ2 t > 0, then training error rate is exp 2 γ 2 T. AdaBoost = Adaptive Boosting Some γ t could be small, even negative only care about overall average γ 2. What about true error rate? γ 2 t. (Figure 1.3 from Schapire & Freund text.) 11 / / 30

4 Combining classifiers Let F be the function class used by the weak learning algorithm WL. The function class used by AdaBoost is F T := x sign α t f t (x) : f 1, f 2,..., f T F, α 1, α 2,..., α T R i.e., linear combinations of T functions from F. Complexity of F T grows linearly with T. Theoretical guarantee (e.g., when F = decision stumps in R d ): With high probability (over random choice of training sample), err( ˆf) ( ) ( ) exp 2 γ 2 T log d T O. }{{} n }{{} Training error rate Error due to finite sample Theory suggests danger of overfitting when T is very large. Indeed, this does happen sometimes... but often not! A typical run of boosting AdaBoostC4. on letters dataset. Error rate C4. test error AdaBoost test error AdaBoost training error # of rounds T (# nodes across all decision trees in ˆf is > ) Training error rate is zero after just five rounds, but test error rate continues to decrease, even up to 1000 rounds! (Figure 1.7 from Schapire & Freund text) 13 / / 30 Boosting the margin Final classifier from AdaBoost: ( T ) ˆf(x) = sign α tf t (x) T α. t }{{} g(x) [ 1, 1] Call y g(x) [ 1, 1] the margin achieved on example (x, y). New theory [Schapire, Freund, Bartlett, and Lee, 1998]: Larger margins better resistance to overfitting, independent of T. AdaBoost tends to increase margins on training examples. (Similar but not the same as SVM margins.) On letters dataset: T = T = 100 T = 1000 training error rate 0.0% 0.0% 0.0% test error rate 8.4% 3.3% 3.1% % margins % 0.0% 0.0% min. margin Linear classifiers Regard function class F used by weak learning algorithm as feature functions : x φ(x) := ( f(x) : f F ) { 1, 1} F (possibly infinite dimensional!). AdaBoost s final classifier is a linear classifier in { 1, 1} F : ˆf(x) = sign α t f t (x) = sign w f f(x) = sign ( w, φ(x) ) f F where w f := α t 1{f t = f} f F. 1 / / 30

5 Exponential loss AdaBoost is a particular coordinate descent algorithm (similar to but not the same as gradient descent) for 1 min w R F n n exp ( y i w, φ(x i ) ). i=1 exp(-x) log(1exp(-x)) More on boosting Many variants of boosting: AdaBoost with different loss functions. Boosted decision trees = boosting decision trees. Boosting algorithms for ranking and multi-class. Boosting algorithms that are robust to certain kinds of noise. Boosting for online learning algorithms (very new!).... Many connections between boosting and other subjects: Game theory, online learning Geometry of information (replace 2 2 with relative entropy divergence) Computational complexity / / 30 Application: face detection Face detection Problem: Given an image, locate all of the faces. An application of AdaBoost As a classification problem: Divide up images into patches (at varying scales, e.g., 24 24, 48 48). Learn classifier f : X Y, where Y = {face, not face}. Many other things built on top of face detectors (e.g., face tracking, face recognizers); now in every digital camera and iphoto/picasa-like software. Main problem: how to make this very fast. 20 / 30

6 Face detectors via AdaBoost [Viola & Jones, 2001] Face detector architecture by Viola & Jones (2001): major achievement in computer vision; detector actually usable in real-time. Think of each image patch (d d-pixel gray-scale) as a vector x [0, 1] d2. Used weak learning algorithm that picks linear classifiers f w,t (x) = sign( w, x t), where w has a very particular form: Viola & Jones integral image trick Integral image trick: (0, 0) For every image, pre-compute (r, c) s(r, c) = sum of pixel values in rectangle from (0, 0) to (r, c) (single pass through image). To compute inner product w, x = average pixel value in black box average pixel value in white box just need to add and subtract a few s(r, c) values. AdaBoost combines many rules-of-thumb of this form. Very simple. Extremely fast to evaluate via pre-computation. Evaluating rules-of-thumb classifiers is extremely fast. 21 / / 30 Viola & Jones cascade architecture Viola & Jones detector: example results Problem: severe class imbalance (most patches don t contain a face). Solution: Train several classifiers (each using AdaBoost), and arrange in a special kind of decision list called a cascade: x f (1) 1 (x) f (2) 1 (x) f (3) (x) Each f (l) is trained (using AdaBoost), adjust threshold (before passing through sign) to minimize false negative rate. Can make f (l) in later stages more complex than in earlier stages, since most examples don t make it to the end. (Cascade) classifier evaluation extremely fast. 23 / / 30

7 Summary Two key points: AdaBoost effectively combines many fast-to-evaluate weak classifiers. Cascade structure optimizes speed for common case. Bagging 2 / 30 Bagging Aside: sampling with replacement Question: if n individuals are picked from a population of size n u.a.r. with replacement, what is the probability that a given individual is not picked? Bagging = Bootstrap aggregating (Leo Breiman, 1994). Input: training data {(x i, y i )} n i=1 from X { 1, 1}. For t = 1, 2,..., T : 1. Randomly pick n examples with replacement from training data {(x (t) i, y (t) i )} n i=1 (a bootstrap sample). 2. Run learning algorithm on {(x (t) i, y (t) i )} n i=1 classifier f t. Return a majority vote classifier over f 1, f 2,..., f T. Answer: ( 1 1 ) n n For large n: Implications for bagging: ( lim 1 1 ) n = 1 n n e Each bootstrap sample contains about 63% of the data set. Remaining 37% can be used to estimate error rate of classifier trained on the bootstrap sample. Can average across bootstrap samples to get estimate of bagged classifier s error rate (sort of). 27 / / 30

8 Random Forests Key takeaways Random Forests (Leo Breiman, 2001). Input: training data {(x i, y i )} n i=1 from R d { 1, 1}. For t = 1, 2,..., T : 1. Randomly pick n examples with replacement from training data {(x (t) i, y (t) i )} n i=1 (a bootstrap sample). 2. Run variant of decision tree learning algorithm on {(x (t) i, y (t) i )} n i=1, where each split is chosen by only considering a random subset of d features (rather than all d features) decision tree classifier f t. Return a majority vote classifier over f 1, f 2,..., f T. 1. Theoretical concept of weak and strong learning. 2. AdaBoost algorithm; concept of margins in boosting. 3. Interpreting AdaBoost s final classifier as a linear classifier, and interpreting AdaBoost as a coordinate descent algorithm. 4. Structure of decision lists / cascades.. Concept of bootstrap samples; bagging and random forests. 29 / / 30

COMS 4771 Lecture Boosting 1 / 16

COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?

More information

Boosting: Foundations and Algorithms. Rob Schapire

Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you

More information

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University BOOSTING Robert E. Schapire and Yoav

More information

Ensembles. Léon Bottou COS 424 4/8/2010

Ensembles Léon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. Léon

More information

Decision Trees: Overfitting

Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

Hierarchical Boosting and Filter Generation

January 29, 2007 Plan Combining Classifiers Boosting Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Hierarchical Structure Filters Combining Classifiers Combining Classifiers

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit

More information

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

The Boosting Approach to Machine Learning Maria-Florina Balcan 10/31/2016 Boosting General method for improving the accuracy of any given learning algorithm. Works by creating a series of challenge datasets

More information

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Adaptive Filters and Machine Learning Boosting and Bagging Background Poltayev Rassulzhan rasulzhan@gmail.com Resampling Bootstrap We are using training set and different subsets in order to validate results

More information

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Ensemble Methods Charles Sutton Data Mining and Exploration Spring 2012 Bias and Variance Consider a regression problem Y = f(x)+ N(0, 2 ) With an estimate regression function ˆf, e.g., ˆf(x) =w > x Suppose

More information

Voting (Ensemble Methods)

1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

More information

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Outline: Ensemble Learning We will describe and investigate algorithms to Ensemble Learning Lecture 10, DD2431 Machine Learning A. Maki, J. Sullivan October 2014 train weak classifiers/regressors and how

More information

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

. Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Statistical Machine Learning from Data

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Statistics and learning: Big Data

Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

Machine Learning, Fall 2011: Homework 5

0-60 Machine Learning, Fall 0: Homework 5 Machine Learning Department Carnegie Mellon University Due:??? Instructions There are 3 questions on this assignment. Please submit your completed homework to

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Ensemble Methods for Machine Learning

Ensemble Methods for Machine Learning COMBINING CLASSIFIERS: ENSEMBLE APPROACHES Common Ensemble classifiers Bagging/Random Forests Bucket of models Stacking Boosting Ensemble classifiers we ve studied

More information

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

The wisdom of the crowds Ensemble learning Sir Francis Galton discovered in the early 1900s that a collection of educated guesses can add up to very accurate predictions! Chapter 11 The paper in which

More information

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

) Set W () i The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m m =.5 1 n W (m 1) i y i h(x i ; 2 ˆθ

More information

CS 231A Section 1: Linear Algebra & Probability Review

CS 231A Section 1: Linear Algebra & Probability Review 1 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability

More information

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

CS 231A Section 1: Linear Algebra & Probability Review Kevin Tang Kevin Tang Section 1-1 9/30/2011 Topics Support Vector Machines Boosting Viola Jones face detector Linear Algebra Review Notation Operations

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,

More information

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Boosting CAP5610: Machine Learning Instructor: Guo-Jun Qi Weak classifiers Weak classifiers Decision stump one layer decision tree Naive Bayes A classifier without feature correlations Linear classifier

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Big Data Analytics Special Topics for Computer Science CSE 4095-001 CSE 5095-005 Feb 24 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Prediction III Goal

More information

Decision trees COMS 4771

Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

More information

Decision trees COMS 4771

Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

More information

Neural Networks and Ensemble Methods for Classification

Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks A neural network is a set of connected input/output units (neurons) where each connection has a weight associated

More information

CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning

More information

VBM683 Machine Learning

VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Günther Eibl and Karl Peter Pfeiffer Institute of Biostatistics, Innsbruck, Austria guenther.eibl@uibk.ac.at Abstract.

More information

Boosting: Algorithms and Applications

Boosting: Algorithms and Applications Lecture 11, ENGN 4522/6520, Statistical Pattern Recognition and Its Applications in Computer Vision ANU 2 nd Semester, 2008 Chunhua Shen, NICTA/RSISE Boosting Definition

More information

Optimal and Adaptive Algorithms for Online Boosting

Optimal and Adaptive Algorithms for Online Boosting Alina Beygelzimer 1 Satyen Kale 1 Haipeng Luo 2 1 Yahoo! Labs, NYC 2 Computer Science Department, Princeton University Jul 8, 2015 Boosting: An Example

More information

Boosting & Deep Learning

Boosting & Deep Learning Ensemble Learning n So far learning methods that learn a single hypothesis, chosen form a hypothesis space that is used to make predictions n Ensemble learning à select a collection

More information

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees

More information

2 Upper-bound of Generalization Error of AdaBoost

COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Haipeng Zheng March 5, 2008 1 Review of AdaBoost Algorithm Here is the AdaBoost Algorithm: input: (x 1,y 1 ),...,(x m,y

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#\$

More information

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Brett Bernstein CDS at NYU March 30, 2017 Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 1 / 14 Initial Question Intro Question Question Suppose 10 different meteorologists have produced functions

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 8: Boosting (and Compression Schemes) Boosting the Error If we have an efficient learning algorithm that for any distribution

More information

Gradient Boosting (Continued)

Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31 Boosting Fits an Additive Model Boosting Fits an Additive

More information

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:

More information

Learning with multiple models. Boosting.

CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

What makes good ensemble? CS789: Machine Learning and Neural Network Ensemble methods Jakramate Bootkrajang Department of Computer Science Chiang Mai University 1. A member of the ensemble is accurate.

More information

Generalized Boosted Models: A guide to the gbm package

Generalized Boosted Models: A guide to the gbm package Greg Ridgeway April 15, 2006 Boosting takes on various forms with different programs using different loss functions, different base models, and different

More information

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

TDT4173 Machine Learning

TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods

More information

Ensembles of Classifiers.

Ensembles of Classifiers www.biostat.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts ensemble bootstrap sample bagging boosting random forests error correcting

More information

Boos\$ng Can we make dumb learners smart?

Boos\$ng Can we make dumb learners smart? Aarti Singh Machine Learning 10-601 Nov 29, 2011 Slides Courtesy: Carlos Guestrin, Freund & Schapire 1 Why boost weak learners? Goal: Automa'cally categorize type

More information

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions with R 1,..., R m R p disjoint. f(x) = M c m 1(x R m ) m=1 The CART algorithm is a heuristic, adaptive

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions f (x) = M c m 1(x R m ) m=1 with R 1,..., R m R p disjoint. The CART algorithm is a heuristic, adaptive

More information

Announcements Kevin Jamieson

Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium

More information

Boosting Methods for Regression

Machine Learning, 47, 153 00, 00 c 00 Kluwer Academic Publishers. Manufactured in The Netherlands. Boosting Methods for Regression NIGEL DUFFY nigeduff@cse.ucsc.edu DAVID HELMBOLD dph@cse.ucsc.edu Computer

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Advanced Machine Learning

Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Ensemble Methods in ML Combining several base classifiers

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

A Brief Introduction to Adaboost

A Brief Introduction to Adaboost Hongbo Deng 6 Feb, 2007 Some of the slides are borrowed from Derek Hoiem & Jan ˇSochman. 1 Outline Background Adaboost Algorithm Theory/Interpretations 2 What s So Good

More information

PDEEC Machine Learning 2016/17

PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /

More information

Why does boosting work from a statistical view

Why does boosting work from a statistical view Jialin Yi Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 939 jialinyi@sas.upenn.edu Abstract We review boosting

More information

Variance Reduction and Ensemble Methods

Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis

More information

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

JMLR: Workshop and Conference Proceedings vol 35:1 8, 014 Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers Balázs Kégl LAL/LRI, University

More information

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Learning Ensembles 293S T. Yang. UCSB, 2017. Outlines Learning Assembles Random Forest Adaboost Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous)

More information

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Ensemble Methods NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan How do you make a decision? What do you want for lunch today?! What did you have last night?! What are your favorite

More information

Name (NetID): (1 Point)

CS446: Machine Learning (D) Spring 2017 March 16 th, 2017 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains

More information

Variance Reduction and Ensemble Methods

Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis

More information

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Neural networks Neural network Another classifier (or regression technique)

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 7 Based in part on slides from textbook, slides of Susan Holmes.

Week 7 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Support Vector Machine, Random Forests, Boosting 2 / 1 Neural networks Neural network Another classifier

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

18.9 SUPPORT VECTOR MACHINES

744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Lecture 13: Ensemble Methods

Lecture 13: Ensemble Methods Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu 1 / 71 Outline 1 Bootstrap

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

Totally Corrective Boosting Algorithms that Maximize the Margin

Totally Corrective Boosting Algorithms that Maximize the Margin Manfred K. Warmuth 1 Jun Liao 1 Gunnar Rätsch 2 1 University of California, Santa Cruz 2 Friedrich Miescher Laboratory, Tübingen, Germany

More information

Multiclass Boosting with Repartitioning

Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006 Binary and Multiclass Problems Binary classification problems Y = { 1, 1} Multiclass classification problems Y

More information

6.036 midterm review. Wednesday, March 18, 15

6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that

More information

Data Warehousing & Data Mining

13. Meta-Algorithms for Classification Data Warehousing & Data Mining Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 13.

More information

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes Support Vector Machine, Random Forests, Boosting December 2, 2012 1 / 1 2 / 1 Neural networks Artificial Neural networks: Networks

More information

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

A Gentle Introduction to Gradient Boosting Cheng Li chengli@ccs.neu.edu College of Computer and Information Science Northeastern University Gradient Boosting a powerful machine learning algorithm it can

More information

the tree till a class assignment is reached

Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Bagging Ryan Tibshirani Data Mining: 36-462/36-662 April 23 2013 Optional reading: ISL 8.2, ESL 8.7 1 Reminder: classification trees Our task is to predict the class label y {1,... K} given a feature vector

More information

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

CSE 473 Chapter 18 Decision Trees and Ensemble Learning Recall: Learning Decision Trees Example: When should I wait for a table at a restaurant? Attributes (features) relevant to Wait? decision: 1. Alternate:

More information

Machine Learning for NLP

Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Lossless Online Bayesian Bagging

Lossless Online Bayesian Bagging Herbert K. H. Lee ISDS Duke University Box 90251 Durham, NC 27708 herbie@isds.duke.edu Merlise A. Clyde ISDS Duke University Box 90251 Durham, NC 27708 clyde@isds.duke.edu

More information

Ensemble Methods: Jay Hyer

Ensemble Methods: committee-based learning Jay Hyer linkedin.com/in/jayhyer @adatahead Overview Why Ensemble Learning? What is learning? How is ensemble learning different? Boosting Weak and Strong Learners

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

BBM406 - Introduc0on to ML Spring 2014 Ensemble Methods Aykut Erdem Dept. of Computer Engineering HaceDepe University 2 Slides adopted from David Sontag, Mehryar Mohri, Ziv- Bar Joseph, Arvind Rao, Greg

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

A Simple Algorithm for Learning Stable Machines

A Simple Algorithm for Learning Stable Machines Savina Andonova and Andre Elisseeff and Theodoros Evgeniou and Massimiliano ontil Abstract. We present an algorithm for learning stable machines which is

More information