Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62

Similar documents
Entropy Regularized LPBoost

Totally Corrective Boosting Algorithms that Maximize the Margin

Entropy Regularized LPBoost

Boosting Algorithms for Maximizing the Soft Margin

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Hierarchical Boosting and Filter Generation

15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018

Stochastic Gradient Descent

Lecture 8. Instructor: Haipeng Luo

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Game Theory, On-line prediction and Boosting (Freund, Schapire)

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Learning, Games, and Networks

COMS 4771 Lecture Boosting 1 / 16

Cutting Plane Training of Structural SVM

CSCI-567: Machine Learning (Spring 2019)

Linear Programming Boosting via Column Generation

ECE 5984: Introduction to Machine Learning

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Learning theory. Ensemble methods. Boosting. Boosting: history

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machine

Accelerating Stochastic Optimization

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

Machine Learning for NLP

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs

Neural Networks and Deep Learning

Introduction to Support Vector Machines

Voting (Ensemble Methods)

Infinite Ensemble Learning with Support Vector Machinery

Motivating examples Introduction to algorithms Simplex algorithm. On a particular example General algorithm. Duality An application to game theory

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

CS7267 MACHINE LEARNING

Multiclass Boosting with Repartitioning

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

A Brief Introduction to Adaboost

i=1 = H t 1 (x) + α t h t (x)

Support Vector Machines

10-701/ Machine Learning - Midterm Exam, Fall 2010

Hybrid Machine Learning Algorithms

The Frank-Wolfe Algorithm:

Learning Ensembles. 293S T. Yang. UCSB, 2017.

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines

Online Kernel PCA with Entropic Matrix Updates

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Machine Learning, Fall 2011: Homework 5

Max Margin-Classifier

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008

Advanced Machine Learning

(Kernels +) Support Vector Machines

Logistic Regression. Machine Learning Fall 2018

Convex Repeated Games and Fenchel Duality

Game Theory. Greg Plaxton Theory in Programming Practice, Spring 2004 Department of Computer Science University of Texas at Austin

ICS-E4030 Kernel Methods in Machine Learning

10701/15781 Machine Learning, Spring 2007: Homework 2

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Boosting. March 30, 2009

Boosting: Foundations and Algorithms. Rob Schapire

New Algorithms for Contextual Bandits

L5 Support Vector Classification

Today: Linear Programming (con t.)

CS229 Supplemental Lecture notes

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Computational and Statistical Learning Theory

Convex Repeated Games and Fenchel Duality

Support vector machines Lecture 4

Brief Introduction to Machine Learning

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines

Online Learning and Online Convex Optimization

Support Vector Machines for Classification and Regression

Support Vector Machines. Machine Learning Fall 2017

Name (NetID): (1 Point)

Jeff Howbert Introduction to Machine Learning Winter

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Maximizing the Margin with Boosting

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Computational and Statistical Learning Theory

Precise Statements of Convergence for AdaBoost and arc-gv

Support Vector Machines: Maximum Margin Classifiers

Lecture 6. Notes on Linear Algebra. Perceptron

Machine Learning for NLP

Linear & nonlinear classifiers

Massachusetts Institute of Technology 6.854J/18.415J: Advanced Algorithms Friday, March 18, 2016 Ankur Moitra. Problem Set 6

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Big Data Analytics: Optimization and Randomization

An Introduction to Boosting and Leveraging

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Transcription:

Updated: March 23, 2010 Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62 ICML 2009 Tutorial Survey of Boosting from an Optimization Perspective Part I: Entropy Regularized LPBoost Part II: Boosting from an Optimization Perspective Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research

Outline 1 Introduction to Boosting 2 What is Boosting? 3 Entropy Regularized LPBoost 4 Overview of Boosting algorithms 5 Conclusion and Open Problems Warmuth (UCSC) ICML 09 Boosting Tutorial 2 / 62

Outline Introduction to Boosting 1 Introduction to Boosting 2 What is Boosting? 3 Entropy Regularized LPBoost 4 Overview of Boosting algorithms 5 Conclusion and Open Problems Warmuth (UCSC) ICML 09 Boosting Tutorial 3 / 62

Introduction to Boosting Setup for Boosting [Giants of field: Schapire,Freund] examples: 11 apples +1 if artificial - 1 if natural goal: classification Warmuth (UCSC) ICML 09 Boosting Tutorial 4 / 62

Introduction to Boosting Setup for Boosting 0.8 +1/-1 examples weight d n size feature 2 0.6 0.4 separable 0.2 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 5 / 62

Introduction to Boosting Weak hypotheses feature 2 0.8 0.6 0.4 0.2 weak hypotheses: decision stumps on two features one can t do it goal: find convex combination of weak hypotheses that classifies all 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 6 / 62

Introduction to Boosting Boosting: 1st iteration 0.8 0.6 First hypothesis: 1 error: 11 9 edge: 11 low error = high edge feature 2 0.4 0.2 0.0 edge = 1 2 error 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 7 / 62

Introduction to Boosting Update after 1st feature 2 0.8 0.6 0.4 Misclassified examples increased weights After update edge of hypothesis decreased 0.2 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 8 / 62

Introduction to Boosting Before 2nd iteration 0.8 Hard examples high weight 0.6 feature 2 0.4 0.2 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 9 / 62

Introduction to Boosting Boosting: 2nd hypothesis Pick hypotheses with high (weighted) edge 0.8 0.6 feature 2 0.4 0.2 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 10 / 62

Introduction to Boosting Update after 2nd 0.8 0.6 After update edges of all past hypotheses should be small feature 2 0.4 0.2 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 11 / 62

3rd hypothesis Introduction to Boosting 0.8 0.6 feature 2 0.4 0.2 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 12 / 62

Update after 3rd Introduction to Boosting 0.8 0.6 feature 2 0.4 0.2 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 13 / 62

4th hypothesis Introduction to Boosting 0.8 0.6 feature 2 0.4 0.2 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 14 / 62

Update after 4th Introduction to Boosting 0.8 0.6 feature 2 0.4 0.2 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 15 / 62

Introduction to Boosting Final convex combination of all hypotheses Decision: T t=1 w th t (x) 0? 0.8 0.6 feature 2 0.4 0.2 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 feature 1 Positive total weight - Negative total weight Warmuth (UCSC) ICML 09 Boosting Tutorial 16 / 62

Introduction to Boosting Protocol of Boosting [FS97] Maintain distribution on N ±1 labeled examples At iteration t = 1,..., T : - Receive weak hypothesis h t of high edge - Update d t 1 to d t more weights on hard examples Output convex combination of the weak hypotheses T t=1 w t h t (x) Two sets of weights: - distribution d on examples - distribution w on hypotheses Warmuth (UCSC) ICML 09 Boosting Tutorial 17 / 62

Data representation Introduction to Boosting examples x n labels y n h 1 (x n ) u 1-1 -1 1-1 -1 1 y n h t (x n ) := un t perfect +1 opposite -1 neutral 0-1 -1 1-1 1 1 1 1 1 1 1 1 1 1 1 1-1 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 18 / 62

Edge vs. margin Introduction to Boosting [Br99] Edge of a hypothesis h t for a distribution d on the examples N n=1 accuracy of example {}}{ u t n d n }{{} weighted accuracy of hypothesis d P N Margin of example n for current hypothesis weighting w T t=1 accuracy of example {}}{ u t n w t }{{} weighted accuracy of example w P T Warmuth (UCSC) ICML 09 Boosting Tutorial 19 / 62

Edge vs. margin Introduction to Boosting [Br99] Edge of a hypothesis h t for a distribution d on the examples N n=1 accuracy of example {}}{ u t n d n }{{} weighted accuracy of hypothesis d P N Margin of example n for current hypothesis weighting w T t=1 accuracy of example {}}{ u t n w t }{{} weighted accuracy of example w P T Warmuth (UCSC) ICML 09 Boosting Tutorial 19 / 62

AdaBoost Introduction to Boosting Initialize t = 0 and d 0 n = 1 N For t = 1,..., T Get h t whose edge w.r.t current distribution is 1 2ɛ ( ) t Set w t = 1 ln 1 ɛ t 2 ɛ t Update distribution as follows d t n = ( T Final hypothesis: sgn t=1 w th t ( ) d n t 1 exp( w t un) t n d t 1 n exp( w t un t ) ) Warmuth (UCSC) ICML 09 Boosting Tutorial 20 / 62

Objectives Introduction to Boosting Edge Margin Edges of past hypotheses should be small after update Minimize maximum edge of past hypotheses Choose convex combination of weak hypotheses that maximizes the minimum margin 0.8 feature 2 0.7 0.6 0.5 0.4 0.3 0.2 0.1 SVM Boosting Which margin? 2-norm (weights on examples) 1-norm (weights on base hypotheses) 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 feature 1 Connection between objectives? Warmuth (UCSC) ICML 09 Boosting Tutorial 21 / 62

Edge vs. margin Introduction to Boosting min max edge = max min margin min d S N max u q d }{{} q=1,2,...,t 1 edge of hypothesis q = max w S t 1 min n=1,2,...,n t 1 un q w q q=1 }{{} margin of example n Linear Programming duality Warmuth (UCSC) ICML 09 Boosting Tutorial 22 / 62

Introduction to Boosting Boosting as zero-sum-game [FS97] Rock, Paper, Scissors game column player R P S w 1 w 2 w 3 row player R d 1 0 1-1 P d 2-1 0 1 S d 3 1-1 0 gain matrix Single row is pure strategy of row player and d is mixed strategy Row player minimizes Column player maximizes payoff = d T U w = i,j d iu i,j w j Single column is pure strategy of column player and w is mixed strategy Warmuth (UCSC) ICML 09 Boosting Tutorial 23 / 62

Optimum strategy Introduction to Boosting Min-max theorem: e j is pure strategy R P S w 1 w 2 w 3.33.33.33 R d 1.33 0 1-1 P d 2.33-1 0 1 S d 3.33 1-1 0 min max d w dt Uw = min max d T Ue j d j = max min d T Uw = max min e i Uw w d w i = value of the game ( 0 in example ) Warmuth (UCSC) ICML 09 Boosting Tutorial 24 / 62

Introduction to Boosting Connection to Boosting? Rows are the examples Columns u q encode weak hypothesis h q Row sum: margin of example Column sum: edge of weak hypothesis Value of game: min max edge = max min margin Van Neumann s Minimax Theorem Warmuth (UCSC) ICML 09 Boosting Tutorial 25 / 62

Edges/margins Introduction to Boosting R P S w 1 w 2 w 3 margin.33.33.33 R d 1.33 0 1-1 0 P d 2.33-1 0 1 0 min S d 3.33 1-1 0 0 edge 0 0 0 max value of game 0 Warmuth (UCSC) ICML 09 Boosting Tutorial 26 / 62

Introduction to Boosting New column added: boosting R P S w 1 w 2 w 3 w 4 margin.44 0.22.33 R d 1.22 0 1-1 1.11 P d 2.33-1 0 1 1.11 min S d 3.44 1-1 0-1.11 edge.11 -.22.11.11 max Value of game increases from 0 to.11 Warmuth (UCSC) ICML 09 Boosting Tutorial 27 / 62

Introduction to Boosting Row added: on-line learning R P S w 1 w 2 w 3 margin.33.44.22 R d 1 0 0 1-1.22 P d 2.22-1 0 1 -.11 min S d 3.44 1-1 0 -.11 d 4.33-1 1-1 -.11 edge -.11 -.11 -.11 max Value of game decreases from 0 to -.11 Warmuth (UCSC) ICML 09 Boosting Tutorial 28 / 62

Introduction to Boosting Boosting: maximize margin incrementally w1 1 d1 1 0 d2 1 1 d3 1-1 w 2 1 w 2 2 d 2 1 0-1 d 2 2 1 0 d 2 3-1 1 w 3 1 w 3 2 w 3 3 d 3 1 0-1 1 d 3 2 1 0-1 d 3 3-1 1 0 iteration 1 iteration 2 iteration 3 In each iteration solve optimization problem to update d Column player / oracle provides new hypothesis Boosting is column generation method in d domain and coordinate descent in w domain Warmuth (UCSC) ICML 09 Boosting Tutorial 29 / 62

Outline What is Boosting? 1 Introduction to Boosting 2 What is Boosting? 3 Entropy Regularized LPBoost 4 Overview of Boosting algorithms 5 Conclusion and Open Problems Warmuth (UCSC) ICML 09 Boosting Tutorial 30 / 62

Want small number of iterations Warmuth (UCSC) ICML 09 Boosting Tutorial 31 / 62 What is Boosting? Boosting = greedy method for increasing margin Converges to optimum margin w.r.t. all hypotheses

What is Boosting? Assumption on next weak hypothesis For current weighting of examples, oracle returns hypothesis of edge g Goal For given ɛ, produce convex combination of weak hypotheses with soft margin g ɛ Number of iterations O( log N ɛ 2 ) Warmuth (UCSC) ICML 09 Boosting Tutorial 32 / 62

Recall min max thm What is Boosting? min d S N max q=1,2,...,t = max w S t u q d }{{} edge of hypothesis q min n=1,2,...,n ( t ) un q w q q=1 }{{} margin of example n Warmuth (UCSC) ICML 09 Boosting Tutorial 33 / 62

What is Boosting? Visualizing the margin feature 2 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 34 / 62

What is Boosting? Min max thm - inseparable case Slack variables in w domain = capping in d domain min d S N,d 1 ν 1 max q=1,2,...,t = max w S t,ψ 0 u q d }{{} edge of hypothesis q min n=1,2,...,n ( t ) un q w q + ψ n 1 ν q=1 }{{} soft margin of example n N n=1 ψ n Warmuth (UCSC) ICML 09 Boosting Tutorial 35 / 62

What is Boosting? Visualizing the soft margin 0.8 0.7 hypothesis 2 0.6 0.5 0.4 0.3 ψ 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 hypothesis 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 36 / 62

LPBoost What is Boosting? Objective value 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 P LP 0.2 0.0 0.2 0.4 0.6 0.8 1.0 d 1 Choose distribution that minimizes the maximum edge of current hypotheses by solving: min P n dn=1,d 1 ν 1 max q=1,2,...,t u q d } {{ } PLP t All weight is put on examples with minimum soft margin Warmuth (UCSC) ICML 09 Boosting Tutorial 37 / 62

Outline Entropy Regularized LPBoost 1 Introduction to Boosting 2 What is Boosting? 3 Entropy Regularized LPBoost 4 Overview of Boosting algorithms 5 Conclusion and Open Problems Warmuth (UCSC) ICML 09 Boosting Tutorial 38 / 62

Entropy Regularized LPBoost Entropy Regularized LPBoost min P n dn=1,d 1 ν 1 max q=1,2,...,t u q d + 1 η (d, d0 ) d n = exp η soft margin of example n Z soft min Form of weights first in ν-arc algorithm [RSS+00] Regularization in d domain makes problem strongly convex Gradient of dual Lipschitz continuous in w [e.g. HL93,RW97] Warmuth (UCSC) ICML 09 Boosting Tutorial 39 / 62

Entropy Regularized LPBoost The effect of entropy regularization Different distribution on the examples 0.8 0.8 feature 2 0.6 0.4 0.2 feature 2 0.6 0.4 0.2 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 feature 1 LPBoost: lots of zeros / brittle 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 feature 1 ERLPBoost: smoother Warmuth (UCSC) ICML 09 Boosting Tutorial 40 / 62

Outline Overview of Boosting algorithms 1 Introduction to Boosting 2 What is Boosting? 3 Entropy Regularized LPBoost 4 Overview of Boosting algorithms 5 Conclusion and Open Problems Warmuth (UCSC) ICML 09 Boosting Tutorial 41 / 62

AdaBoost Overview of Boosting algorithms [FS97] d t n := where w t s.t. n d t 1 n exp( w u t n i.e. P n dt 1 n exp( w u t n ) w w=wt = n ut n d n t 1 exp( w t un) t n d t 1 n exp( w t un t ), ) is minimized d t 1 n exp( w t u t n) Pn dt 1 n exp( w t u t n ) = ut d t = 0 Easy to implement Adjusts distribution so that edge of last hypothesis is zero Gets within half of the optimal hard margin [RSD07] but only in the limit Warmuth (UCSC) ICML 09 Boosting Tutorial 42 / 62

Overview of Boosting algorithms Corrective versus totally corrective Processing last hypothesis versus all past hypotheses Corrective AdaBoost LogitBoost AdaBoost* SS,Colt08 Totally Corrective LPBoost TotalBoost SoftBoost ERLPBoost Warmuth (UCSC) ICML 09 Boosting Tutorial 43 / 62

Overview of Boosting algorithms From AdaBoost to ERLPBoost AdaBoost (as interpreted in [KW99,La99]) Primal: Dual: min d (d, d t 1 ) s.t. d u t = 0, d 1 = 1 Achieves half of optimum hard margin in the limit AdaBoost Primal: min d (d, d t 1 ) s.t. d u t γ t, d 1 = 1 max ln w n d n t 1 exp( ηunw t t ) s.t. w 0 Dual: max w [RW05] ln n d t 1 n exp( ηu t nw t ) γ t w 1 s.t. w 0 where edge bound γ t is adjusted downward by a heuristic Good iteration bound for reaching optimum hard margin Warmuth (UCSC) ICML 09 Boosting Tutorial 44 / 62

SoftBoost Primal: Overview of Boosting algorithms min (d, d 0 ) d s.t. d 1 = 1, d 11 ν d u q γ t, 1 q t Dual: [WGR07] min ln d 0 n exp( η t unw q q w,ψ n q=1 ηψ n ) 1 ψ ν 1 γ t w 1 s.t. w 0, ψ 0 where edge bound γ t is adjusted downward by a heuristic Good iteration bound for reaching soft margin ERLPBoost Primal: min d,γ γ + 1 η (d, d0 ) s.t. d 1 = 1, d 11 ν d u q γ, 1 q t Dual: [WGV08] min 1 ln d 0 w,ψ η n exp( η t unw q q n q=1 ηψ n ) 1 ψ ν 1 s.t. w 0, w 1 = 1, ψ 0 where for the iteration bound η is fixed to max( 2 ɛ ln N ν, 1 2 ) Good iteration bound for reaching soft margin Warmuth (UCSC) ICML 09 Boosting Tutorial 45 / 62

Overview of Boosting algorithms Corrective ERLPBoost Primal: min d t q=1 w q(u q d) + 1 (d, η d0 ) s.t. d 1 = 1, d 11 ν Dual: min 1 ln ψ η n s.t. ψ 0 d 0 n exp( η t unw q q ηψ n ) 1 ψ ν 1 q=1 [SS08] where for the iteration bound η is fixed to max( 2 ɛ ln N ν, 1 2 ) Good iteration bound for reaching soft margin Warmuth (UCSC) ICML 09 Boosting Tutorial 46 / 62

Iteration bounds Overview of Boosting algorithms Corrective AdaBoost LogitBoost AdaBoost* SS,Colt08 Totally Corrective LPBoost TotalBoost SoftBoost ERLPBoost Strong oracle: returns hypothesis with maximum edge Weak oracle: returns hypothesis with edge g In O( log N ν ) iterations 2 within ɛ of maximum soft margin for strong oracle or within ɛ of g for weak oracle Ditto for hard margin case In O( log N ) iterations consistency with weak oracle g 2 Warmuth (UCSC) ICML 09 Boosting Tutorial 47 / 62

Overview of Boosting algorithms LPBoost may require Ω(N) iterations w 1 w 2 w 3 w 4 w 5 margin 0 0 0 0 0 d 1.125 +1 -.95 -.93 -.91 -.99 d 2.125 +1 -.95 -.93 -.91 -.99 d 3.125 +1 -.95 -.93 -.91 -.99 d 4.125 +1 -.95 -.93 -.91 -.99 d 5.125 -.98 +1 -.93 -.91 +.99 d 6.125 -.97 -.96 +1 -.91 +.99 d 7.125 -.97 -.95 -.94 +1 +.99 d 8.125 -.97 -.95 -.93 -.92 +.99 edge.0137 -.7075 -.6900 -.6725.0000 value -1 Warmuth (UCSC) ICML 09 Boosting Tutorial 48 / 62

Overview of Boosting algorithms LPBoost may require Ω(N) iterations w 1 w 2 w 3 w 4 w 5 margin 1 0 0 0 0 d 1 0 +1 -.95 -.93 -.91 -.99 1 d 2 0 +1 -.95 -.93 -.91 -.99 1 d 3 0 +1 -.95 -.93 -.91 -.99 1 d 4 0 +1 -.95 -.93 -.91 -.99 1 d 5 1 -.98 +1 -.93 -.91 +.99 -.98 d 6 0 -.97 -.96 +1 -.91 +.99 -.97 d 7 0 -.97 -.95 -.94 +1 +.99 -.97 d 8 0 -.97 -.95 -.93 -.92 +.99 -.97 edge -.98 1 -.93 -.91.99 value -1 -.98 Warmuth (UCSC) ICML 09 Boosting Tutorial 49 / 62

Overview of Boosting algorithms LPBoost may require Ω(N) iterations w 1 w 2 w 3 w 4 w 5 margin 0 1 0 0 0 d 1 0 +1 -.95 -.93 -.91 -.99 -.95 d 2 0 +1 -.95 -.93 -.91 -.99 -.95 d 3 0 +1 -.95 -.93 -.91 -.99 -.95 d 4 0 +1 -.95 -.93 -.91 -.99 -.95 d 5 0 -.98 +1 -.93 -.91 +.99 1 d 6 1 -.97 -.96 +1 -.91 +.99 -.96 d 7 0 -.97 -.95 -.94 +1 +.99 -.95 d 8 0 -.97 -.95 -.93 -.92 +.99 -.95 edge -.97 -.96 1 -.91.99 value -1 -.98 -.96 Warmuth (UCSC) ICML 09 Boosting Tutorial 50 / 62

Overview of Boosting algorithms LPBoost may require Ω(N) iterations w 1 w 2 w 3 w 4 w 5 margin 0 0 1 0 0 d 1 0 +1 -.95 -.93 -.91 -.99 -.93 d 2 0 +1 -.95 -.93 -.91 -.99 -.93 d 3 0 +1 -.95 -.93 -.91 -.99 -.93 d 4 0 +1 -.95 -.93 -.91 -.99 -.93 d 5 0 -.98 +1 -.93 -.91 +.99 -.93 d 6 0 -.97 -.96 +1 -.91 +.99 1 d 7 1 -.97 -.95 -.94 +1 +.99 -.94 d 8 0 -.97 -.95 -.93 -.92 +.99 -.93 edge -.97 -.95 -.94 1.99 value -1 -.98 -.96 -.94 Warmuth (UCSC) ICML 09 Boosting Tutorial 51 / 62

Overview of Boosting algorithms LPBoost may require Ω(N) iterations w 1 w 2 w 3 w 4 w 5 margin 0 0 0 1 0 d 1 0 +1 -.95 -.93 -.91 -.99 -.91 d 2 0 +1 -.95 -.93 -.91 -.99 -.91 d 3 0 +1 -.95 -.93 -.91 -.99 -.91 d 4 0 +1 -.95 -.93 -.91 -.99 -.91 d 5 0 -.98 +1 -.93 -.91 +.99 -.91 d 6 0 -.97 -.96 +1 -.91 +.99 -.91 d 7 0 -.97 -.95 -.94 +1 +.99 1 d 8 1 -.97 -.95 -.93 -.92 +.99 -.92 edge -.97 -.95 -.94 -.92.99 value -1 -.98 -.96 -.94 -.92 Warmuth (UCSC) ICML 09 Boosting Tutorial 52 / 62

Overview of Boosting algorithms LPBoost may require Ω(N) iterations w 1 w 2 w 3 w 4 w 5 margin.5.0026 0 0.4975 d 1.497 +1 -.95 -.93 -.91 -.99.0051 d 2 0 +1 -.95 -.93 -.91 -.99.0051 d 3 0 +1 -.95 -.93 -.91 -.99.0051 d 4 0 +1 -.95 -.93 -.91 -.99.0051 d 5 0 -.98 +1 -.93 -.91 +.99.0051 d 6.490 -.97 -.96 +1 -.91 +.99.0051 d 7 0 -.97 -.95 -.94 +1 +.99.0051 d 8.013 -.97 -.95 -.93 -.92 +.99.0051 edge.0051.0051.9055.9100.0051 value -1 -.98 -.96 -.94 -.92.0051 No ties! Warmuth (UCSC) ICML 09 Boosting Tutorial 53 / 62

Overview of Boosting algorithms LPBoost may return bad final hypothesis How good is the master hypothesis returned by LPBoost compared to the best possible convex combination of hypotheses? Any linearly separable dataset can be reduced to a dataset on which LPBoost misclassifies all examples by adding a bad example adding a bad hypothesis Warmuth (UCSC) ICML 09 Boosting Tutorial 54 / 62

Overview of Boosting algorithms Adding a bad example w 1 w 2 w 3 w 4 w 5 margin.5.0026 0 0.4975 d 1 0 +1 -.95 -.93 -.91 -.99.0051 d 2 0 +1 -.95 -.93 -.91 -.99.0051 d 3 0 +1 -.95 -.93 -.91 -.99.0051 d 4 0 +1 -.95 -.93 -.91 -.99.0051 d 5 0 -.98 +1 -.93 -.91 +.99.0051 d 6 0 -.97 -.96 +1 -.91 +.99.0051 d 7 0 -.97 -.95 -.94 +1 +.99.0051 d 8 0 -.97 -.95 -.93 -.92 +.99.0051 d 9 1 -.03 -.03 -.03 -.03 -.03.03 edge.03.03.03.03.03 value -1 -.98 -.96 -.94 -.92.03 Warmuth (UCSC) ICML 09 Boosting Tutorial 55 / 62

Overview of Boosting algorithms Adding a bad hypothesis w 1 w 2 w 3 w 4 w 5 w 6 margin 0 0 0 0 0 1 d 1 0 +1 -.95 -.93 -.91 -.99 -.01.0051 d 2 0 +1 -.95 -.93 -.91 -.99 -.01.0051 d 3 0 +1 -.95 -.93 -.91 -.99 -.01.0051 d 4 0 +1 -.95 -.93 -.91 -.99 -.01.0051 d 5 0 -.98 +1 -.93 -.91 +.99 -.01.0051 d 6 0 -.97 -.96 +1 -.91 +.99 -.01.0051 d 7 0 -.97 -.95 -.94 +1 +.99 -.01.0051 d 8 0 -.97 -.95 -.93 -.92 +.99 -.01.0051 d 9 1 -.03 -.03 -.03 -.03 -.03 -.02.0051 edge.03.03.03.03.03.02 value -1 -.98 -.96 -.94 -.92 -.03 Warmuth (UCSC) ICML 09 Boosting Tutorial 56 / 62

Overview of Boosting algorithms Adding a bad hypothesis w 1 w 2 w 3 w 4 w 5 w 6 margin 0 0 0 0 0 1 d 1 0 +1 -.95 -.93 -.91 -.99 -.01.01 d 2 0 +1 -.95 -.93 -.91 -.99 -.01.01 d 3 0 +1 -.95 -.93 -.91 -.99 -.01.01 d 4 0 +1 -.95 -.93 -.91 -.99 -.01.01 d 5 0 -.98 +1 -.93 -.91 +.99 -.01.01 d 6 0 -.97 -.96 +1 -.91 +.99 -.01.01 d 7 0 -.97 -.95 -.94 +1 +.99 -.01.01 d 8 0 -.97 -.95 -.93 -.92 +.99 -.01.01 d 9 1 -.03 -.03 -.03 -.03 -.03 -.02.02 edge.03.03.03.03.03.02 value -1 -.98 -.96 -.94 -.92 -.03 -.02 Warmuth (UCSC) ICML 09 Boosting Tutorial 56 / 62

Overview of Boosting algorithms Adding a bad hypothesis w 1 w 2 w 3 w 4 w 5 w 6 margin 0 0 0 0 0 1 d 1 0 +1 -.95 -.93 -.91 -.99 -.01.01 d 2 0 +1 -.95 -.93 -.91 -.99 -.01.01 d 3 0 +1 -.95 -.93 -.91 -.99 -.01.01 d 4 0 +1 -.95 -.93 -.91 -.99 -.01.01 d 5 0 -.98 +1 -.93 -.91 +.99 -.01.01 d 6 0 -.97 -.96 +1 -.91 +.99 -.01.01 d 7 0 -.97 -.95 -.94 +1 +.99 -.01.01 d 8 0 -.97 -.95 -.93 -.92 +.99 -.01.01 d 9 1 -.03 -.03 -.03 -.03 -.03 -.02.02 edge.03.03.03.03.03.02 value -1 -.98 -.96 -.94 -.92 -.03 -.02 Warmuth (UCSC) ICML 09 Boosting Tutorial 56 / 62

Overview of Boosting algorithms Adding a bad hypothesis w 1 w 2 w 3 w 4 w 5 w 6 margin.5 0 0 0.5 0 d 1 0 +1 -.95 -.93 -.91 -.99 -.01 +.005 d 2 0 +1 -.95 -.93 -.91 -.99 -.01 +.005 d 3 0 +1 -.95 -.93 -.91 -.99 -.01 +.005 d 4 0 +1 -.95 -.93 -.91 -.99 -.01 +.005 d 5 0 -.98 +1 -.93 -.91 +.99 -.01 +.005 d 6 0 -.97 -.96 +1 -.91 +.99 -.01 +.01 d 7 0 -.97 -.95 -.94 +1 +.99 -.01 +.01 d 8 0 -.97 -.95 -.93 -.92 +.99 -.01 +.01 d 9 1 -.03 -.03 -.03 -.03 -.03 -.02.03 Warmuth (UCSC) ICML 09 Boosting Tutorial 56 / 62

Overview of Boosting algorithms Synopsis LPBoost often unstable For safety, add relative entropy regularization Corrective algs Sometimes easy to code Fast per iteration Totally corrective algs Smaller number of iterations Faster overall time when ɛ small Weak versus strong oracle makes a big difference in practice Warmuth (UCSC) ICML 09 Boosting Tutorial 57 / 62

Overview of Boosting algorithms O( log N ) iteration bounds ɛ 2 Good Bound is major design tool Any reasonable Boosting algorithm should have this bound Bad ln N N ɛ 2 Bound is weak ɛ =.01 N 1.2 10 5 ɛ =.001 N 1.7 10 7 Why are totally corrective algorithms much better in practice? Warmuth (UCSC) ICML 09 Boosting Tutorial 58 / 62

Overview of Boosting algorithms Lower bounds on the number of iterations Majority of Ω( log N ) hypotheses for achieving consistency with g 2 weak oracle of guarantee g [Fr95] Easy: Ω( 1 ɛ 2 ) iteration bound for getting within ɛ of hard margin with strong oracle Harder: Ω( log N ɛ 2 ) iteration bound for stron oracle [Ne83?] Warmuth (UCSC) ICML 09 Boosting Tutorial 59 / 62

Outline Conclusion and Open Problems 1 Introduction to Boosting 2 What is Boosting? 3 Entropy Regularized LPBoost 4 Overview of Boosting algorithms 5 Conclusion and Open Problems Warmuth (UCSC) ICML 09 Boosting Tutorial 60 / 62

Conclusion Conclusion and Open Problems Adding relative entropy regularization of LPBoost leads to good boosting alg. Boosting is instantiation of MaxEnt and MinxEnt principles [Jaines 57,Kullback 59] Relative entropy regularization smoothes one-norm regularization Open When hypotheses have one-sided error then O( log N ) iterations suffice [As00,HW03] ɛ Does ERLPBoost have O( log N ) bound when hypotheses ɛ one-sided? Replace geometric optimizers by entropic ones Compare ours with Freund s algorithms that don t just cap, but forget examples Warmuth (UCSC) ICML 09 Boosting Tutorial 61 / 62

Acknowledgment Conclusion and Open Problems Rob Schapire and Yoav Freund for pioneering Boosting Gunnar Rätsch for bringing in optimization Karen Glocer for helping with figures and plots Warmuth (UCSC) ICML 09 Boosting Tutorial 62 / 62