Ensemble Methods: Boosting

Similar documents
Boostrapaggregating (Bagging)

Generalized Linear Methods

CSE 546 Midterm Exam, Fall 2014(with Solution)

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

CSC 411 / CSC D11 / CSC C11

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Homework Assignment 3 Due in class, Thursday October 15

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Linear Classification, SVMs and Nearest Neighbors

Kernel Methods and SVMs Extension

Feature Selection: Part 1

10-701/ Machine Learning, Fall 2005 Homework 3

Lecture Notes on Linear Regression

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Multilayer Perceptron (MLP)

Lecture 10 Support Vector Machines. Oct

Online Classification: Perceptron and Winnow

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Which Separator? Spring 1

Natural Language Processing and Information Retrieval

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

COS 511: Theoretical Machine Learning

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Support Vector Machines CS434

Support Vector Machines CS434

Support Vector Machines

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

1 The Mistake Bound Model

Bounds on the Generalization Performance of Kernel Machines Ensembles

EEE 241: Linear Systems

CSCI B609: Foundations of Data Science

SDMML HT MSc Problem Sheet 4

Support Vector Machines

Support Vector Machines

Classification as a Regression Problem

1 Convex Optimization

Lagrange Multipliers Kernel Trick

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Kristin P. Bennett. Rensselaer Polytechnic Institute

Calculation of time complexity (3%)

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Lecture 4. Instructor: Haipeng Luo

Intro to Visual Recognition

Semi-Supervised Learning

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Lecture 20: November 7

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Singular Value Decomposition: Theory and Applications

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

/ n ) are compared. The logic is: if the two

Machine learning: Density estimation

Learning Theory: Lecture Notes

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

Large-Margin HMM Estimation for Speech Recognition

Negative Binomial Regression

Multi-layer neural networks

COS 521: Advanced Algorithms Game Theory and Linear Programming

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Lecture 3: Dual problems and Kernels

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Clustering & Unsupervised Learning

Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen

Lecture 10 Support Vector Machines II

Course 395: Machine Learning - Lectures

Solutions to exam in SF1811 Optimization, Jan 14, 2015

18-660: Numerical Methods for Engineering Design and Optimization

Vapnik-Chervonenkis theory

Clustering & (Ken Kreutz-Delgado) UCSD

Chapter 9: Statistical Inference and the Relationship between Two Variables

Classification. Representing data: Hypothesis (classifier) Lecture 2, September 14, Reading: Eric CMU,

THE ARIMOTO-BLAHUT ALGORITHM FOR COMPUTATION OF CHANNEL CAPACITY. William A. Pearlman. References: S. Arimoto - IEEE Trans. Inform. Thy., Jan.

Lecture 12: Classification

Bayesian Learning. Smart Home Health Analytics Spring Nirmalya Roy Department of Information Systems University of Maryland Baltimore County

Limited Dependent Variables

IV. Performance Optimization

The exam is closed book, closed notes except your one-page cheat sheet.

Problem Set 9 Solutions

Logistic Classifier CISC 5800 Professor Daniel Leeds

Maximal Margin Classifier

Week 5: Neural Networks

Evaluation of classifiers MLPs

Evaluation for sets of classes

Structured Perceptrons & Structural SVMs

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

COMP th April, 2007 Clement Pang

Evaluation of simple performance measures for tuning SVM hyperparameters

Chapter 11: Simple Linear Regression and Correlation

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Gaussian Mixture Models

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

EM and Structure Learning

Transcription:

Ensemble Methods: Boostng Ncholas Ruozz Unversty of Texas at Dallas Based on the sldes of Vbhav Gogate and Rob Schapre

Last Tme Varance reducton va baggng Generate new tranng data sets by samplng wth replacement from the emprcal dstrbuton Learn a classfer for each of the newly sampled sets Combne the classfers for predcton Today: how to reduce bas 2

Boostng How to translate rules of thumb (.e., good heurstcs) nto good learnng algorthms For example, f we are tryng to classfy emal as spam or not spam, a good rule of thumb may be that emals contanng Ngeran prnce or Vagara are lkely to be spam most of the tme 3

Boostng Freund & Schapre Theory for weak learners n late 80 s Weak Learner: performance on any tranng set s slghtly better than chance predcton Intended to answer a theoretcal queston, not as a practcal way to mprove learnng Tested n md 90 s usng not-so-weak learners Works anyway! 4

PAC Learnng Gven..d samples from an unknown, arbtrary dstrbuton Strong PAC learnng algorthm For any dstrbuton wth hgh probablty gven polynomally many samples (and polynomal tme) can fnd classfer wth arbtrarly small error Weak PAC learnng algorthm Same, but error only needs to be slghtly better than random guessng (e.g., accuracy only needs to exceed 50% for bnary classfcaton) Does weak learnablty mply strong learnablty? 5

Boostng 1. Weght all tranng samples equally 2. Tran model on tranng set 3. Compute error of model on tranng set 4. Increase weghts on tranng cases model gets wrong 5. Tran new model on re-weghted tranng set 6. Re-compute errors on weghted tranng set 7. Increase weghts agan on cases model gets wrong Repeat untl tred (100+ teratons) Fnal model: weghted predcton of each model 6

Boostng: Graphcal Illustraton h 1 x h 2 x h M (x) h x = sgn α m h m (x) m

AdaBoost 1. Intalze the data weghts w 1,, w N for the frst round as w 1 1,, w N 1 = 1 N 2. For m = 1,, M a) Select a classfer h m for the m th round by mnmzng the weghted error b) Compute w (m) 1hm x y c) Update the weghts ε m = w m 1 hm x α m = 1 2 ln 1 ε m ε m y w m+1 = w m exp y h m x () α m 2 ε m 1 ε m

AdaBoost 1. Intalze the data weghts w 1,, w N for the frst round as w 1 1,, w N 1 = 1 N 2. For m = 1,, M a) Select a classfer h m for the m th round by mnmzng the weghted error b) Compute c) Update the weghts w (m) 1hm x y ε m = w m 1 hm x α m = 1 2 ln 1 ε m ε m y Weghted number of ncorrect classfcatons of the m th classfer w m+1 = w m exp y h m x () α m 2 ε m 1 ε m

AdaBoost 1. Intalze the data weghts w 1,, w N for the frst round as w 1 1,, w N 1 = 1 N 2. For m = 1,, M a) Select a classfer h m for the m th round by mnmzng the weghted error b) Compute w (m) 1hm x y c) Update the weghts ε m = w m 1 hm x α m = 1 2 ln 1 ε m ε m y ε m 0 α m w m+1 = w m exp y h m x () α m 2 ε m 1 ε m

AdaBoost 1. Intalze the data weghts w 1,, w N for the frst round as w 1 1,, w N 1 = 1 N 2. For m = 1,, M a) Select a classfer h m for the m th round by mnmzng the weghted error b) Compute w (m) 1hm x y c) Update the weghts ε m = w m 1 hm x α m = 1 2 ln 1 ε m ε m y ε m.5 α m 0 w m+1 = w m exp y h m x () α m 2 ε m 1 ε m

AdaBoost 1. Intalze the data weghts w 1,, w N for the frst round as w 1 1,, w N 1 = 1 N 2. For m = 1,, M a) Select a classfer h m for the m th round by mnmzng the weghted error b) Compute w (m) 1hm x y c) Update the weghts ε m = w m 1 hm x α m = 1 2 ln 1 ε m ε m y ε m 1 α m w m+1 = w m exp y h m x () α m 2 ε m 1 ε m

AdaBoost 1. Intalze the data weghts w 1,, w N for the frst round as w 1 1,, w N 1 = 1 N 2. For m = 1,, M a) Select a classfer h m for the m th round by mnmzng the weghted error b) Compute w (m) 1hm x y c) Update the weghts ε m = w m 1 hm x α m = 1 2 ln 1 ε m ε m y w m+1 = w m exp y h m x () α m 2 ε m 1 ε m Normalzaton factor

Example Consder a classfcaton problem where vertcal and horzontal lnes (and ther correspondng half spaces) are the weak learners D Round 1 Round 2 Round 3 + + + + + + + + + + + + + + + + + + + + h 3 h 1 ε 1 =.3 α 1 =.42 ε 2 =.21 α 2 =.65 ε 3 =.14 α 3 =.92 h 2 14

Fnal Hypothess h x = sgn.42 +.65 +.92 Fnal Hypothess + + + + + h 3

Boostng Theorem: Let Z m = 2 ε m 1 ε m and γ m = 1 2 ε m. 1 M 1 h x () y M Z m = 1 4γ m 2 N m=1 m=1 So, even f all of the γ s are small postve numbers (.e., every learner s a weak learner), the tranng error goes to zero as M ncreases 16

Margns & Boostng We can see that tranng error goes down, but what about test error? That s, does boostng help us generalze better? To answer ths queston, we need to look at how confdent we are n our predctons How can we measure ths? 17

Margns & Boostng We can see that tranng error goes down, but what about test error? That s, does boostng help us generalze better? To answer ths queston, we need to look at how confdent we are n our predctons Margns! 18

Margns & Boostng Intuton: larger margns lead to better generalzaton (same as SVMs) Theorem: wth hgh probablty, boostng ncreases the sze of the margns Note: boostng does NOT maxmze the margn, so t can stll have poor generalzaton performance 19

Boostng Performance 20

Boostng as Optmzaton AdaBoost can actually be nterpreted as a coordnate descent method for a specfc loss functon! Let h 1,, h T Exponental loss be the set of all weak learners l α 1,, α T = exp y α t h t (x () ) t Convex n α t AdaBoost mnmzes ths exponental loss 21

Coordnate Descent Mnmze the loss wth respect to a sngle component of α, let s pck α t dl dα t = y h t x exp y α t h t x t = :h t x =y exp α t exp y t t α t h t x + exp(α t ) exp y α t h t x :h t x y t t = 0 22

Coordnate Descent Solvng for α t α t = 1 2 ln σ :ht x σ :ht x =y exp y σ t t α t h t x y exp y σ t t α t h t x Ths s smlar to the adaboost update! The only dfference s that adaboost tells us n whch order we should update the varables 23

Coordnate Descent Start wth α = 0 Let r = exp y σ t t α t h t x = 1 Choose t to mnmze :h t x r = N y w 1 1 ht x () y For ths choce of t, mnmze the objectve wth respect to α t gves 1 1 ht x () =y α t = 1 N σ w 2 ln N σ w 1 = 1 1 2 ln 1 ε 1 ε ht x () y 1 Repeatng ths procedure wth new values of α yelds adaboost 24

adaboost as Optmzaton Could derve an adaboost algorthm for other types of loss functons! Important to note Exponental loss s convex, but may have multple global optma In practce, adaboost can perform qute dfferently than other methods for mnmzng ths loss (e.g., gradent descent) 25

Boostng n Practce Our descrpton of the algorthm assumed that a set of possble hypotheses was gven In practce, the set of hypotheses can be bult as the algorthm progress Example: buld new decson tree at each teraton for the data set such that the th example has weght w (m) When computng nformaton gan, compute the emprcal probabltes usng the weghts 26

Boostng vs. Baggng Baggng doesn t work so well wth stable models. Boostng mght stll help Boostng mght hurt performance on nosy datasets Baggng doesn t have ths problem On average, boostng helps more than baggng, but t s also more common for boostng to hurt performance. Baggng s easer to parallelze 27

Other Approaches Mxture of Experts (See Bshop, Chapter 14) Cascadng Classfers many others