Totally Corrective Boosting Algorithms that Maximize the Margin

Similar documents
Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62

Boosting Algorithms for Maximizing the Soft Margin

Entropy Regularized LPBoost

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Maximizing the Margin with Boosting

Entropy Regularized LPBoost

The Dynamics of AdaBoost: Cyclic Behavior and Convergence of Margins

Game Theory, On-line prediction and Boosting (Freund, Schapire)

Hierarchical Boosting and Filter Generation

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

i=1 = H t 1 (x) + α t h t (x)

Learning Ensembles. 293S T. Yang. UCSB, 2017.

A Brief Introduction to Adaboost

Interactive Feature Selection with

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Machine Learning. Ensemble Methods. Manfred Huber

Learning with multiple models. Boosting.

Learning, Games, and Networks

Learning theory. Ensemble methods. Boosting. Boosting: history

Online Kernel PCA with Entropic Matrix Updates

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

Infinite Ensemble Learning with Support Vector Machinery

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

COMS 4771 Lecture Boosting 1 / 16

Precise Statements of Convergence for AdaBoost and arc-gv

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

CS7267 MACHINE LEARNING

Boosting: Foundations and Algorithms. Rob Schapire

CSCI-567: Machine Learning (Spring 2019)

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Voting (Ensemble Methods)

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

CS229 Supplemental Lecture notes

Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

Large-Margin Thresholded Ensembles for Ordinal Regression

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Boosting Methods for Regression

2 Upper-bound of Generalization Error of AdaBoost

Lecture 8. Instructor: Haipeng Luo

Advanced Machine Learning

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Monte Carlo Theory as an Explanation of Bagging and Boosting

Linear Programming Boosting via Column Generation

An Introduction to Boosting and Leveraging

Boosting. March 30, 2009

Support Vector Machines

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Reducing the Overfitting of AdaBoost by Controlling its Data Distribution Skewness

Adaptive Boosting of Neural Networks for Character Recognition

Data Warehousing & Data Mining

Generalization Bounds

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Statistics and learning: Big Data

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Large-Margin Thresholded Ensembles for Ordinal Regression

Multiclass Boosting with Repartitioning

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

CS 231A Section 1: Linear Algebra & Probability Review

Chapter 14 Combining Models

Novel Distance-Based SVM Kernels for Infinite Ensemble Learning

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

ECE 5424: Introduction to Machine Learning

Support Vector Machines

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Boosting: Algorithms and Applications

Stochastic Gradient Descent

Support Vector Machine (continued)

Convex Repeated Games and Fenchel Duality

Contents of this Course Part I: Introduction to Boosting 1. Basic Issues, PAC Learning 2. The Idea of Boosting 3. AdaBoost s Behavior on the Sample Pa

Logistic Regression, AdaBoost and Bregman Distances

Computational and Statistical Learning Theory

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Accelerating Stochastic Optimization

Data Mining und Maschinelles Lernen

CS534 Machine Learning - Spring Final Exam

The No-Regret Framework for Online Learning

FINAL: CS 6375 (Machine Learning) Fall 2014

A ROBUST ENSEMBLE LEARNING USING ZERO-ONE LOSS FUNCTION

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Convex Repeated Games and Fenchel Duality

Notes on Machine Learning for and

The Boosting Approach to Machine Learning. Rob Schapire Princeton University. schapire

Introduction to Boosting and Joint Boosting

Game Theory. Greg Plaxton Theory in Programming Practice, Spring 2004 Department of Computer Science University of Texas at Austin

Leaving The Span Manfred K. Warmuth and Vishy Vishwanathan

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Barrier Boosting. Abstract. 1 Introduction. 2 Boosting and Convex Programming

Ensembles. Léon Bottou COS 424 4/8/2010

Machine Learning. Support Vector Machines. Manfred Huber

Neural Networks and Deep Learning

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Boosting & Deep Learning

Zero-Sum Games Public Strategies Minimax Theorem and Nash Equilibria Appendix. Zero-Sum Games. Algorithmic Game Theory.

Transcription:

Totally Corrective Boosting Algorithms that Maximize the Margin Manfred K. Warmuth 1 Jun Liao 1 Gunnar Rätsch 2 1 University of California, Santa Cruz 2 Friedrich Miescher Laboratory, Tübingen, Germany Last update: March 2, 2007 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 1 / 62

Outline 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 2 / 62

Outline Basics 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 3 / 62

Protocol of Boosting Basics Maintain a distribution d t on the examples At iteration t = 1,..., T : 1 Receive a weak hypothesis h t 2 Update d t to d t+1, put more weights on hard examples Output a convex combination of the weak hypotheses f α (x) = T α t h t (x) t=1 [Freund & Schapire, 1995] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 4 / 62

Protocol of Boosting Basics Maintain a distribution d t on the examples At iteration t = 1,..., T : 1 Receive a weak hypothesis h t 2 Update d t to d t+1, put more weights on hard examples Output a convex combination of the weak hypotheses f α (x) = T α t h t (x) t=1 [Freund & Schapire, 1995] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 4 / 62

Protocol of Boosting Basics Maintain a distribution d t on the examples At iteration t = 1,..., T : 1 Receive a weak hypothesis h t 2 Update d t to d t+1, put more weights on hard examples Output a convex combination of the weak hypotheses f α (x) = T α t h t (x) t=1 [Freund & Schapire, 1995] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 4 / 62

Boosting: 1st Iteration Basics First hypothesis: 2 Error rate: 11 N ɛ t = dni(h t t (x n ) = y n ) n=1 Edge: 9 γ t = 22 N dny t n h t (x n ) n=1 = 1 2ɛ t Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 5 / 62

Boosting: 1st Iteration Basics First hypothesis: 2 Error rate: 11 N ɛ t = dni(h t t (x n ) = y n ) n=1 Edge: 9 γ t = 22 N dny t n h t (x n ) n=1 = 1 2ɛ t Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 5 / 62

Update Distribution Basics Misclassified examples Increased weights After update: Error rate: ɛ(h t, d t+1 ) = 1 2 Edge: γ(h t, d t+1 ) = 0 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 6 / 62

Update Distribution Basics Misclassified examples Increased weights After update: Error rate: ɛ(h t, d t+1 ) = 1 2 Edge: γ(h t, d t+1 ) = 0 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 6 / 62

Basics Ada-Boost as Entropy Projection Minimize relative entropy to last distribution subject to constraint where min (d, d t ) d N s.t. d n y n h t (x n ) = 0 n=1 d P N (d, d t ) = N n=1 d n ln dn and dn t P N is the N dimensional probability simplex [Lafferty, 1999; Kivinen & Warmuth, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 7 / 62

Before 2nd Iteration Basics Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 8 / 62

Basics Boosting: 2nd Hypothesis AdaBoost assumption: Edge γ > ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 9 / 62

Update Distribution Basics Edge γ = 0 AdaBoost update sets edge of last hypothesis to 0 Number of iterations: 2 ln N ν 2 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 10 / 62

Which constraints? Basics Corrective - Ada-Boost: Single constraint min d P N (d, d t ) s.t. N d n y n h t (x n ) = 0 n=1 for n = 1,..., t Totally corrective: One constraint per past weak hypothesis min (d, d 1 ) d P N s.t. N d n y n h q (x n ) = 0 n=1 for q = 1,..., t [Lafferty, 1999; Kivinen & Warmuth, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 11 / 62

Which constraints? Basics Corrective - Ada-Boost: Single constraint min d P N (d, d t ) s.t. N d n y n h t (x n ) = 0 n=1 for n = 1,..., t Totally corrective: One constraint per past weak hypothesis min (d, d 1 ) d P N s.t. N d n y n h q (x n ) = 0 n=1 for q = 1,..., t [Lafferty, 1999; Kivinen & Warmuth, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 11 / 62

Basics Boosting: 3nd Hypothesis Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 12 / 62

Basics Boosting: 4th Hypothesis Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 13 / 62

All Hypotheses Basics Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 14 / 62

Basics Decision: f α (x) = T t=1 α th t (x) > 0? Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 15 / 62

Outline Large margins 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 16 / 62

Large margins Large margins in addition to correct classification Margin of the combined hypothesis f α for example (x n, y n ) ρ n (α) = y n f α (x n ) T = y n α t h t (x n ) (α P T ) t=1 Margin of set of examples is minimum over examples ρ(α) := min ρ n (α) n [Freund, Schapire, Bartlett & Lee, 1998] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 17 / 62

Large margins Large Margin and Linear Separation Input space X Feature space F Φ Linear separation in F is nonlinear separation in X h 1 (x) Φ(x) = h 2 (x). H = {h 1, h 2,...} [Mangasarian, 1999; G.R., Mika, Schölkopf & Müller, 2002] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 18 / 62

Margin vs. edge Large margins Margin Measure for confidence in prediction for a hypothesis weighting Margin of example n for current hypothesis weighting α T ρ n (α) = y n f α (x n ) = y n α t h t (x n ) α P T Edge Measurement of goodness of a hypothesis w.r.t. a distribution Edge of a hypothesis h for a distribution d on the examples N γ h (d) = d n y n h(x n ) d P N What is the connection? n=1 t=1 [Breiman, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 19 / 62

Margin vs. edge Large margins Margin Measure for confidence in prediction for a hypothesis weighting Margin of example n for current hypothesis weighting α T ρ n (α) = y n f α (x n ) = y n α t h t (x n ) α P T Edge Measurement of goodness of a hypothesis w.r.t. a distribution Edge of a hypothesis h for a distribution d on the examples N γ h (d) = d n y n h(x n ) d P N What is the connection? n=1 t=1 [Breiman, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 19 / 62

Margin vs. edge Large margins Margin Measure for confidence in prediction for a hypothesis weighting Margin of example n for current hypothesis weighting α T ρ n (α) = y n f α (x n ) = y n α t h t (x n ) α P T Edge Measurement of goodness of a hypothesis w.r.t. a distribution Edge of a hypothesis h for a distribution d on the examples N γ h (d) = d n y n h(x n ) d P N What is the connection? n=1 t=1 [Breiman, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 19 / 62

Large margins von Neumann s Minimax-Theorem Set of examples S = {(x 1, y 1 ),..., (x N, y N )} and hypotheses set H t = {h 1,..., h t }, minimum edge: γ t =min maximum margin: ρ t =max max d P N h H t min α P t n edge of h {}}{ γ h (d) } {{ } edge of H t margin of (x n, y n) {}}{ y n f α (x n ) } {{ } margin of S Duality: γ t =ρ t [von Neumann, 1928] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 20 / 62

Large margins von Neumann s Minimax-Theorem Set of examples S = {(x 1, y 1 ),..., (x N, y N )} and hypotheses set H t = {h 1,..., h t }, minimum edge: γ t =min d P N maximum margin: ρ t =max t max q=1 { edge of h q }} { N n=1 d n y n h q (x n ) } {{ } edge of H t min y n α P t n q=1 Duality: γ t =ρ t margin of (x n, y n) {}}{ t α t h q (x n ) } {{ } margin of S Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 21 / 62

Outline Games 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 22 / 62

Games Two-player Zero Sum Game Rock, Paper, Scissors game column player R P S α 1 α 2 α 3 row player R d 1 0 1-1 P d 2-1 0 1 S d 3 1-1 0 gain matrix Single row is pure strategy of row player and d is mixed strategy Single column is pure strategy of column player and α is mixed strategy Row player minimizes Column player maximizes payoff = d T M α = i,j d im i,j α j [Freund and Schapire, 1997] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 23 / 62

Optimum Strategy Games Min-max theorem: R P S α 1 α 2 α 3.33.33.33 R d 1.33 0 1-1 P d 2.33-1 0 1 S d 3.33 1-1 0 min max d α dt Mα = min max d T Me j d j = max min d T Mα = max min e i Mα α d α i = value of the game ( 0 in example ) e j is pure strategy Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 24 / 62

Games Connection to Boosting? Rows are the examples Columns the weak hypothesis M i,j = h j (x i )y i Row sum: margin of example Column sum: edge of weak hypothesis Value of game: γ = ρ Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 25 / 62

Games New column added: boosting R P S α 1 α 2 α 3 α 4 margin.44 0.22.33 R d 1.22 0 1-1 1.11 P d 2.33-1 0 1 1.11 S d 3.44 1-1 0-1.11 edge.11 -.22.11.11 Value of game increases from 0 to.11 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 26 / 62

Games Row added: on-line learning R P S α 1 α 2 α 3 margin.33.44.22 R d 1 0 0 1-1.22 P d 2.22-1 0 1 -.11 S d 3.44 1-1 0 -.11 d 4.33-1 1-1 -.11 edge -.11 -.11 -.11 Value of game decreases from 0 to -.11 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 27 / 62

Games Boosting: maximize margin incrementally α1 1 d1 1 0 d2 1 1 d3 1-1 α 2 1 α 2 2 d 2 1 0-1 d 2 2 1 0 d 2 3-1 1 α 3 1 α 3 2 α 3 3 d 3 1 0-1 1 d 3 2 1 0-1 d 3 3-1 1 0 iteration 1 iteration 2 iteration 3 In each iteration solve optimization problem to update d Column player adds new column - weak hypothesis Some assumptions will be needed on the edge of the added hypothesis Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 28 / 62

Value non-decreasing Games γ, ρ : edge/margin for all hypotheses Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 29 / 62

Duality gap Games For any non-optimal d P N and α P t, γ(d) γ t = ρ t ρ(α) Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 30 / 62

Games How Large is the Maximal Margin? Assumptions on Weak learner For any distribution d on the examples, the weak learner returns a hypothesis h with edge γ h (d) at least g. Best case: g = ρ = γ [Breiman, 1999; Bennett et al.; G.R. et al., 2001; Rudin et al., 2004] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 31 / 62

Games How Large is the Maximal Margin? Assumptions on Weak learner For any distribution d on the examples, the weak learner returns a hypothesis h with edge γ h (d) at least g. Best case: g = ρ = γ Implication from Minimax Theorem There exists α P N, such that ρ(α) g [Breiman, 1999; Bennett et al.; G.R. et al., 2001; Rudin et al., 2004] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 31 / 62

Games How Large is the Maximal Margin? Assumptions on Weak learner For any distribution d on the examples, the weak learner returns a hypothesis h with edge γ h (d) at least g. Best case: g = ρ = γ Implication from Minimax Theorem There exists α P N, such that ρ(α) g Idea to iteratively solve LP: LPBoost Add best hypothesis h = argmax γ h (d t ) to H t and resolve d t+1 = argmin max γ h(d) d P N h H t [Breiman, 1999; Bennett et al.; G.R. et al., 2001; Rudin et al., 2004] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 31 / 62

Convergence? Games LPBoost? No iteration bounds known AdaBoost? May oscillate Does not find maximizing α (counter examples) But there some guarantees: ρ(α t ) 0 after 2 ln N/g 2 iterations ρ(α t ) g/2 in the limit [Breiman, 1999; Bennett et al.; G.R. et al., 2001; Rudin et al., 2004] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 32 / 62

Games How to maximize the margin? Modify AdaBoost for maximizing margin Arc-GV asymptotically maximizes the margin quite slow, no convergence rates LPBoost uses a Linear Programming solver Often very fast in practice, but no converge rates AdaBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Slow in practice, i.e. not faster than theory predicts TotalBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Fast in practice Combination of benefits [G.R. & Warmuth, 2004; Warmuth et al., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 33 / 62

Games How to maximize the margin? Modify AdaBoost for maximizing margin Arc-GV asymptotically maximizes the margin quite slow, no convergence rates LPBoost uses a Linear Programming solver Often very fast in practice, but no converge rates AdaBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Slow in practice, i.e. not faster than theory predicts TotalBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Fast in practice Combination of benefits [G.R. & Warmuth, 2004; Warmuth et al., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 33 / 62

Games How to maximize the margin? Modify AdaBoost for maximizing margin Arc-GV asymptotically maximizes the margin quite slow, no convergence rates LPBoost uses a Linear Programming solver Often very fast in practice, but no converge rates AdaBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Slow in practice, i.e. not faster than theory predicts TotalBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Fast in practice Combination of benefits [G.R. & Warmuth, 2004; Warmuth et al., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 33 / 62

Games How to maximize the margin? Modify AdaBoost for maximizing margin Arc-GV asymptotically maximizes the margin quite slow, no convergence rates LPBoost uses a Linear Programming solver Often very fast in practice, but no converge rates AdaBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Slow in practice, i.e. not faster than theory predicts TotalBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Fast in practice Combination of benefits [G.R. & Warmuth, 2004; Warmuth et al., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 33 / 62

Outline Convergence 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 34 / 62

Convergence Want margin g ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 35 / 62

Convergence Want margin g ν Assumption: γ t g Estimate of target: γ t = (min t q=1 γ q ) ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 35 / 62

Convergence Idea: Projections to ˆγ t instead of 0 Corrective: Single constraint min d P N (d, d t ) AdaBoost ν N s.t. d n y n h t (x n ) ˆγ t for r = 1,..., t n=1 Totally corrective: One constraint per past weak hypothesis min (d, d 1 ) TotalBoost ν d P N N s.t. d n y n h q (x n ) ˆγ t for q = 1,..., t n=1 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 36 / 62

Convergence TotalBoost ν 1 Input: S = (x 1, y 1 ),..., (x N, y N ), desired accuracy ν 2 Initialize: d 1 n = 1/N for all n = 1... N 3 Do for t = 1,... 1 Train classifier on {S, d t } and obtain hypothesis h t : x [ 1, 1] and let u t i = y i h t (x i ) 2 Calculate the edge γ t of h t : γ t = d t u t 3 Set γ t = ( min t γ q) ν and solve q=1 d t+1 = argmin (d, d 1 ) {d P N d u q bγ t, for 1 q t}=c t 4 If above infeasible or d t+1 contains a zero then T = t and break 4 Output: f α (x) = T t=1 α th t (x), where the coefficients α t maximize margin over hypotheses set {h 1,..., h T }. Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 37 / 62

Convergence TotalBoost ν 1 Input: S = (x 1, y 1 ),..., (x N, y N ), desired accuracy ν 2 Initialize: d 1 n = 1/N for all n = 1... N 3 Do for t = 1,... 1 Train classifier on {S, d t } and obtain hypothesis h t : x [ 1, 1] and let u t i = y i h t (x i ) 2 Calculate the edge γ t of h t : γ t = d t u t 3 Set γ t = ( min t γ q) ν and solve q=1 d t+1 = argmin (d, d 1 ) {d P N d u q bγ t, for 1 q t}=c t 4 If above infeasible or d t+1 contains a zero then T = t and break 4 Output: f α (x) = T t=1 α th t (x), where the coefficients α t maximize margin over hypotheses set {h 1,..., h T }. Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 37 / 62

Convergence TotalBoost ν 1 Input: S = (x 1, y 1 ),..., (x N, y N ), desired accuracy ν 2 Initialize: d 1 n = 1/N for all n = 1... N 3 Do for t = 1,... 1 Train classifier on {S, d t } and obtain hypothesis h t : x [ 1, 1] and let u t i = y i h t (x i ) 2 Calculate the edge γ t of h t : γ t = d t u t 3 Set γ t = ( min t γ q) ν and solve q=1 doptimization t+1 = Problem argmin (d, d 1 ) {d P N d u q bγ t, for 1 q t}=c t 4 Ifd t+1 above= infeasible argminor (d, d t+1 dcontains 1 ) a zero d C t then T = t and break with C t := {d P N d u q γ t, for 1 q t} 4 Output: f α (x) = T t=1 α th t (x), where the coefficients α t maximize margin over hypotheses set {h 1,..., h T }. Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 37 / 62

Convergence Effect of entropy regularization High weight given to hard examples Softmin of margins dn t+1 exp( y n f α t+1(x n ) }{{} ) margin of ex. n α t+1 are current coefficients of hypotheses Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 38 / 62

Convergence Lower Bounding the Progress Theorem For d t, d t+1 P N and u t [ 1, 1] N, if (d t+1, d t ) finite and d t+1 u t d t u t then (d t+1, d t ) > (dt+1 u t d t u t ) 2 2 d t u t = γ t d t+1 u t γ t γ t ν Thus d t u t d t+1 u t ν and (d t+1, d t ) > ν2 2 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 39 / 62

Convergence Generalized Pythagorean Theorem C t = {d P N d u q γ t, 1 q t}, C 0 = P N, C t C t 1 d t is projection of d 1 onto C t 1 at iteration t 1 d t = argmin d C t 1 (d, d 1 ) (d t+1, d 1 ) (d t, d 1 ) + (d t+1, d t ) [Herbster & Warmuth, 2001] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 40 / 62

Sketch of Proof Convergence 1: (d 2, d 1 ) (d 1, d 1 ) (d 2, d 1 ) > ν2 2 2: (d 3, d 1 ) (d 2, d 1 ) (d 3, d 2 ) > ν2 2 t: (d t+1, d 1 ) (d t, d 1 ) (d t+1, d t ) > ν2 2 T 2: (d T 1, d 1 ) (d T 2, d 1 ) (d T 1, d T 2 ) > ν2 2 T 1: (d T, d 1 ) (d T 1, d 1 ) (d T, d T 1 ) > ν2 2 [Warmuth, Liao & G.R., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 41 / 62

Cancellation Convergence 0 {}}{ 1: (d 2, d 1 ) (d 1, d 1 ) (d 2, d 1 ) > ν2 2 2: (d 3, d 1 ) (d 2, d 1 ) (d 3, d 2 ) > ν2 2 t: (d t+1, d 1 ) (d t, d 1 ) (d t+1, d t ) > ν2 2 T 2: (d T 1, d 1 ) (d T 2, d 1 ) (d T 1, d T 2 ) > ν2 2 T 1: (d T, d 1 ) (d }{{} T 1, d 1 ) (d T, d T 1 ) > ν2 2 ln N Therefore, T 2 ln N ν 2 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 42 / 62

Overview Convergence [Long & Wu, 2002] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 43 / 62

Convergence Iteration Bounds for Other Variants Using the same techniques, we can prove iteration bound for: TotalBoost ν which optimizes the divergence (d, d t ) to the last distribution d t TotalBoost ν which uses the binary relative entropy 2 (d, d 1 ) or 2 (d, d t ) as the divergence The variant of AdaBoost ν which terminates when γ t < γ t Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 44 / 62

Convergence TotalBoost ν which optimizes (d, d t ) d t+1 is projection of d t onto C t at iteration t d t+1 = argmin d C t (d, d t ) (d T, d t ) (d T, d t+1 ) + (d t+1, d t ) Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 45 / 62

Sketch of Proof Convergence 1: (d T, d 1 ) (d T, d 2 ) (d 2, d 1 ) > ν2 2 2: (d T, d 2 ) (d T, d 3 ) (d 3, d 2 ) > ν2 2 t: (d T, d t ) (d T, d t+1 ) (d t+1, d t ) > ν2 2 T 2: (d T, d T 2 ) (d T, d T 1 ) (d T 1, d T 2 ) > ν2 2 T 1: (d T, d T 1 ) (d T, d T ) (d T, d T 1 ) > ν2 2 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 46 / 62

Cancellation Convergence ln N {}}{ 1: (d T, d 1 ) (d T, d 2 ) (d 2, d 1 ) > ν2 2 2: (d T, d 2 ) (d T, d 3 ) (d 3, d 2 ) > ν2 2 t: (d T, d t ) (d T, d t+1 ) (d t+1, d t ) > ν2 2 T 2: (d T, d T 2 ) (d T, d T 1 ) (d T 1, d T 2 ) > ν2 2 T 1: (d T, d T 1 ) (d T, d T ) (d }{{} T, d T 1 ) > ν2 2 0 Therefore, T 2 ln N ν 2 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 47 / 62

Convergence TotalBoost ν which optimizes (d, d t ) Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 48 / 62

Outline Illustrative Experiments 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments [Warmuth, Liao & G.R., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 49 / 62

Illustrative Experiments Illustrative Experiments Cox-1 dataset from Telik Inc. Relatively small drug-design data set 125 binary labeled examples 3888 binary features Compare convergence of margin versus number of iterations [Warmuth, Liao & G.R., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 50 / 62

Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 51 / 62

Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 51 / 62

Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 51 / 62

Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 51 / 62

Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 51 / 62

Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 51 / 62

Illustrative Experiments LPBoost May Perform Much Worse Than TotalBoost Identified cases where LPBoost converges considerably slower than TotalBoost ν Dataset is a series of artificial datasets of 1000 examples with varying number of features created as follows: First generated N 1 random ±1-valued features x 1,..., x N1 and set the label of the examples as y = sign(x 1 + x 2 + x 3 + x 4 + x 5 ) Then duplicated each features N 2 times, perturbed the features by Gaussian noise with σ = 0.1, and clipped the feature values so that they lie in the interval [-1,1] Considered different N 1, N 2, the total number of features is N 2 N 1 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 52 / 62

Illustrative Experiments LPBoost performs worse for high dimensional data with many redundant features LPBoost vs. TotalBoostν on two 100,000 dimensional datasets: [left] many redundant features (N1 = 1, 000, N2 = 100) and [right] independent features (N1 = 100, 000, N2 = 1). Show margin vs. number of iterations Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 53 / 62

Illustrative Experiments Bound Not True for LPBoost A counter example: Examples Hypothesis No. 1 2 3 4 5 6 + - - - - - + - - - - - + - - - - - + - - - - - + - - - - - - + - - - + - - + - - + - - - + - + - - - - + + Table 4.1: A toy dataset. Each row corresponds an example and each column corresponds a hypothesis. +: correct prediction -: incorrect prediction If the prediction of an example by a hypothesis is correct, it is displayed as + otherwise displayed as TotalBoost -. averages hypothesis 1 and 6 (i.e. 2 iterations) to achieve maximum margin 0 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 54 / 62

Illustrative Experiments Bound Not True for LPBoost Distribuition of Examples Hypothesis No. Iteration No. 1 2 3 4 5 6 1 2 3 4 5 + - - - - - 1/9 0 0 0 0 + - - - - - 1/9 0 0 0 0 + - - - - - 1/9 0 0 0 0 + - - - - - 1/9 0 0 0 0 + - - - - - 1/9 0 0 0 0 - + - - - + 1/9 1 0 0 0 - - + - - + 1/9 0 1 0 0 - - - + - + 1/9 0 0 1 0 - - - - + + 1/9 0 0 0 1 Selected Hypothesis No. 1 2 3 4 5 Table 4.2: The first five iterations of LPBoost. We show the distribution of examples and the selected Simplex-based hypothesis No. at LPBoost each iteration. uses 5 iterations ((N+1)/2 iterations) to achieve margin 0 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 55 / 62

Illustrative Experiments Regularized LPBoost LPBoost makes edge constraints as tight as possible Picking solutions at the corners can lead to slow convergence. Interior points methods avoid corners Regularized LPBoost: pick solution that minimizes relative entropy to initial distribution. It is identical to TotalBoost ν but the latter algorithm uses a higher edge bound Open problem: find an example where all versions of LPBoost need O(N) iterations Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 56 / 62

Illustrative Experiments Algorithms for Feature Selection We test for each algorithm: Whether selected hypotheses are redundant Size of final hypothesis Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 57 / 62

Illustrative Experiments Redundancy of Selected Hypotheses Test which algorithm selects redundant hypotheses Dataset: Dataset 2 was created by expanding Rudin s dataset. Dataset 2 has 100 blocks of features. Each block contains one original feature and 99 mutations of this feature. Each mutation feature inverts the feature value of one randomly chosen example (with replacement). The mutation features are considered redundant hypotheses. A good algorithm would avoid repeatedly selecting hypotheses from the same block. Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 58 / 62

Illustrative Experiments Experiment on Redundancy of Selected Hypotheses Show number of selected blocks v.s. number of selected hypotheses Redundancy of selected hypotheses: AdaBoost > TotalBoost ν (ν=0.01) > LPBoost If ρ is known, TotalBoost g ν selects one hypothesis per block (not shown). Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 59 / 62

Illustrative Experiments Size of Final Hypothesis Show margins v.s. number of used hypotheses (nonzero weight) TotalBoost ν (ν=0.01) and LPBoost use a small number of hypotheses in final hypothesis Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 60 / 62

Summary Illustrative Experiments AdaBoost can be viewed as entropy projection TotalBoost projects based on all previous hypotheses Provably maximizes the margin Theory: as fast as AdaBoost ν Practice: much faster ( LPBoost) Versatile techniques for proving iteration bounds Experiments corroborate our theory Good for feature selection LPBoost may have problems of maximizing the margin Future: extension to the soft margin case Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 61 / 62

Illustrative Experiments Iteration Bound for Variant of AdaBoost ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 62 / 62