Totally Corrective Boosting Algorithms that Maximize the Margin Manfred K. Warmuth 1 Jun Liao 1 Gunnar Rätsch 2 1 University of California, Santa Cruz 2 Friedrich Miescher Laboratory, Tübingen, Germany Last update: March 2, 2007 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 1 / 62
Outline 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 2 / 62
Outline Basics 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 3 / 62
Protocol of Boosting Basics Maintain a distribution d t on the examples At iteration t = 1,..., T : 1 Receive a weak hypothesis h t 2 Update d t to d t+1, put more weights on hard examples Output a convex combination of the weak hypotheses f α (x) = T α t h t (x) t=1 [Freund & Schapire, 1995] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 4 / 62
Protocol of Boosting Basics Maintain a distribution d t on the examples At iteration t = 1,..., T : 1 Receive a weak hypothesis h t 2 Update d t to d t+1, put more weights on hard examples Output a convex combination of the weak hypotheses f α (x) = T α t h t (x) t=1 [Freund & Schapire, 1995] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 4 / 62
Protocol of Boosting Basics Maintain a distribution d t on the examples At iteration t = 1,..., T : 1 Receive a weak hypothesis h t 2 Update d t to d t+1, put more weights on hard examples Output a convex combination of the weak hypotheses f α (x) = T α t h t (x) t=1 [Freund & Schapire, 1995] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 4 / 62
Boosting: 1st Iteration Basics First hypothesis: 2 Error rate: 11 N ɛ t = dni(h t t (x n ) = y n ) n=1 Edge: 9 γ t = 22 N dny t n h t (x n ) n=1 = 1 2ɛ t Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 5 / 62
Boosting: 1st Iteration Basics First hypothesis: 2 Error rate: 11 N ɛ t = dni(h t t (x n ) = y n ) n=1 Edge: 9 γ t = 22 N dny t n h t (x n ) n=1 = 1 2ɛ t Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 5 / 62
Update Distribution Basics Misclassified examples Increased weights After update: Error rate: ɛ(h t, d t+1 ) = 1 2 Edge: γ(h t, d t+1 ) = 0 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 6 / 62
Update Distribution Basics Misclassified examples Increased weights After update: Error rate: ɛ(h t, d t+1 ) = 1 2 Edge: γ(h t, d t+1 ) = 0 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 6 / 62
Basics Ada-Boost as Entropy Projection Minimize relative entropy to last distribution subject to constraint where min (d, d t ) d N s.t. d n y n h t (x n ) = 0 n=1 d P N (d, d t ) = N n=1 d n ln dn and dn t P N is the N dimensional probability simplex [Lafferty, 1999; Kivinen & Warmuth, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 7 / 62
Before 2nd Iteration Basics Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 8 / 62
Basics Boosting: 2nd Hypothesis AdaBoost assumption: Edge γ > ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 9 / 62
Update Distribution Basics Edge γ = 0 AdaBoost update sets edge of last hypothesis to 0 Number of iterations: 2 ln N ν 2 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 10 / 62
Which constraints? Basics Corrective - Ada-Boost: Single constraint min d P N (d, d t ) s.t. N d n y n h t (x n ) = 0 n=1 for n = 1,..., t Totally corrective: One constraint per past weak hypothesis min (d, d 1 ) d P N s.t. N d n y n h q (x n ) = 0 n=1 for q = 1,..., t [Lafferty, 1999; Kivinen & Warmuth, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 11 / 62
Which constraints? Basics Corrective - Ada-Boost: Single constraint min d P N (d, d t ) s.t. N d n y n h t (x n ) = 0 n=1 for n = 1,..., t Totally corrective: One constraint per past weak hypothesis min (d, d 1 ) d P N s.t. N d n y n h q (x n ) = 0 n=1 for q = 1,..., t [Lafferty, 1999; Kivinen & Warmuth, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 11 / 62
Basics Boosting: 3nd Hypothesis Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 12 / 62
Basics Boosting: 4th Hypothesis Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 13 / 62
All Hypotheses Basics Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 14 / 62
Basics Decision: f α (x) = T t=1 α th t (x) > 0? Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 15 / 62
Outline Large margins 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 16 / 62
Large margins Large margins in addition to correct classification Margin of the combined hypothesis f α for example (x n, y n ) ρ n (α) = y n f α (x n ) T = y n α t h t (x n ) (α P T ) t=1 Margin of set of examples is minimum over examples ρ(α) := min ρ n (α) n [Freund, Schapire, Bartlett & Lee, 1998] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 17 / 62
Large margins Large Margin and Linear Separation Input space X Feature space F Φ Linear separation in F is nonlinear separation in X h 1 (x) Φ(x) = h 2 (x). H = {h 1, h 2,...} [Mangasarian, 1999; G.R., Mika, Schölkopf & Müller, 2002] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 18 / 62
Margin vs. edge Large margins Margin Measure for confidence in prediction for a hypothesis weighting Margin of example n for current hypothesis weighting α T ρ n (α) = y n f α (x n ) = y n α t h t (x n ) α P T Edge Measurement of goodness of a hypothesis w.r.t. a distribution Edge of a hypothesis h for a distribution d on the examples N γ h (d) = d n y n h(x n ) d P N What is the connection? n=1 t=1 [Breiman, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 19 / 62
Margin vs. edge Large margins Margin Measure for confidence in prediction for a hypothesis weighting Margin of example n for current hypothesis weighting α T ρ n (α) = y n f α (x n ) = y n α t h t (x n ) α P T Edge Measurement of goodness of a hypothesis w.r.t. a distribution Edge of a hypothesis h for a distribution d on the examples N γ h (d) = d n y n h(x n ) d P N What is the connection? n=1 t=1 [Breiman, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 19 / 62
Margin vs. edge Large margins Margin Measure for confidence in prediction for a hypothesis weighting Margin of example n for current hypothesis weighting α T ρ n (α) = y n f α (x n ) = y n α t h t (x n ) α P T Edge Measurement of goodness of a hypothesis w.r.t. a distribution Edge of a hypothesis h for a distribution d on the examples N γ h (d) = d n y n h(x n ) d P N What is the connection? n=1 t=1 [Breiman, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 19 / 62
Large margins von Neumann s Minimax-Theorem Set of examples S = {(x 1, y 1 ),..., (x N, y N )} and hypotheses set H t = {h 1,..., h t }, minimum edge: γ t =min maximum margin: ρ t =max max d P N h H t min α P t n edge of h {}}{ γ h (d) } {{ } edge of H t margin of (x n, y n) {}}{ y n f α (x n ) } {{ } margin of S Duality: γ t =ρ t [von Neumann, 1928] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 20 / 62
Large margins von Neumann s Minimax-Theorem Set of examples S = {(x 1, y 1 ),..., (x N, y N )} and hypotheses set H t = {h 1,..., h t }, minimum edge: γ t =min d P N maximum margin: ρ t =max t max q=1 { edge of h q }} { N n=1 d n y n h q (x n ) } {{ } edge of H t min y n α P t n q=1 Duality: γ t =ρ t margin of (x n, y n) {}}{ t α t h q (x n ) } {{ } margin of S Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 21 / 62
Outline Games 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 22 / 62
Games Two-player Zero Sum Game Rock, Paper, Scissors game column player R P S α 1 α 2 α 3 row player R d 1 0 1-1 P d 2-1 0 1 S d 3 1-1 0 gain matrix Single row is pure strategy of row player and d is mixed strategy Single column is pure strategy of column player and α is mixed strategy Row player minimizes Column player maximizes payoff = d T M α = i,j d im i,j α j [Freund and Schapire, 1997] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 23 / 62
Optimum Strategy Games Min-max theorem: R P S α 1 α 2 α 3.33.33.33 R d 1.33 0 1-1 P d 2.33-1 0 1 S d 3.33 1-1 0 min max d α dt Mα = min max d T Me j d j = max min d T Mα = max min e i Mα α d α i = value of the game ( 0 in example ) e j is pure strategy Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 24 / 62
Games Connection to Boosting? Rows are the examples Columns the weak hypothesis M i,j = h j (x i )y i Row sum: margin of example Column sum: edge of weak hypothesis Value of game: γ = ρ Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 25 / 62
Games New column added: boosting R P S α 1 α 2 α 3 α 4 margin.44 0.22.33 R d 1.22 0 1-1 1.11 P d 2.33-1 0 1 1.11 S d 3.44 1-1 0-1.11 edge.11 -.22.11.11 Value of game increases from 0 to.11 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 26 / 62
Games Row added: on-line learning R P S α 1 α 2 α 3 margin.33.44.22 R d 1 0 0 1-1.22 P d 2.22-1 0 1 -.11 S d 3.44 1-1 0 -.11 d 4.33-1 1-1 -.11 edge -.11 -.11 -.11 Value of game decreases from 0 to -.11 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 27 / 62
Games Boosting: maximize margin incrementally α1 1 d1 1 0 d2 1 1 d3 1-1 α 2 1 α 2 2 d 2 1 0-1 d 2 2 1 0 d 2 3-1 1 α 3 1 α 3 2 α 3 3 d 3 1 0-1 1 d 3 2 1 0-1 d 3 3-1 1 0 iteration 1 iteration 2 iteration 3 In each iteration solve optimization problem to update d Column player adds new column - weak hypothesis Some assumptions will be needed on the edge of the added hypothesis Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 28 / 62
Value non-decreasing Games γ, ρ : edge/margin for all hypotheses Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 29 / 62
Duality gap Games For any non-optimal d P N and α P t, γ(d) γ t = ρ t ρ(α) Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 30 / 62
Games How Large is the Maximal Margin? Assumptions on Weak learner For any distribution d on the examples, the weak learner returns a hypothesis h with edge γ h (d) at least g. Best case: g = ρ = γ [Breiman, 1999; Bennett et al.; G.R. et al., 2001; Rudin et al., 2004] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 31 / 62
Games How Large is the Maximal Margin? Assumptions on Weak learner For any distribution d on the examples, the weak learner returns a hypothesis h with edge γ h (d) at least g. Best case: g = ρ = γ Implication from Minimax Theorem There exists α P N, such that ρ(α) g [Breiman, 1999; Bennett et al.; G.R. et al., 2001; Rudin et al., 2004] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 31 / 62
Games How Large is the Maximal Margin? Assumptions on Weak learner For any distribution d on the examples, the weak learner returns a hypothesis h with edge γ h (d) at least g. Best case: g = ρ = γ Implication from Minimax Theorem There exists α P N, such that ρ(α) g Idea to iteratively solve LP: LPBoost Add best hypothesis h = argmax γ h (d t ) to H t and resolve d t+1 = argmin max γ h(d) d P N h H t [Breiman, 1999; Bennett et al.; G.R. et al., 2001; Rudin et al., 2004] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 31 / 62
Convergence? Games LPBoost? No iteration bounds known AdaBoost? May oscillate Does not find maximizing α (counter examples) But there some guarantees: ρ(α t ) 0 after 2 ln N/g 2 iterations ρ(α t ) g/2 in the limit [Breiman, 1999; Bennett et al.; G.R. et al., 2001; Rudin et al., 2004] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 32 / 62
Games How to maximize the margin? Modify AdaBoost for maximizing margin Arc-GV asymptotically maximizes the margin quite slow, no convergence rates LPBoost uses a Linear Programming solver Often very fast in practice, but no converge rates AdaBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Slow in practice, i.e. not faster than theory predicts TotalBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Fast in practice Combination of benefits [G.R. & Warmuth, 2004; Warmuth et al., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 33 / 62
Games How to maximize the margin? Modify AdaBoost for maximizing margin Arc-GV asymptotically maximizes the margin quite slow, no convergence rates LPBoost uses a Linear Programming solver Often very fast in practice, but no converge rates AdaBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Slow in practice, i.e. not faster than theory predicts TotalBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Fast in practice Combination of benefits [G.R. & Warmuth, 2004; Warmuth et al., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 33 / 62
Games How to maximize the margin? Modify AdaBoost for maximizing margin Arc-GV asymptotically maximizes the margin quite slow, no convergence rates LPBoost uses a Linear Programming solver Often very fast in practice, but no converge rates AdaBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Slow in practice, i.e. not faster than theory predicts TotalBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Fast in practice Combination of benefits [G.R. & Warmuth, 2004; Warmuth et al., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 33 / 62
Games How to maximize the margin? Modify AdaBoost for maximizing margin Arc-GV asymptotically maximizes the margin quite slow, no convergence rates LPBoost uses a Linear Programming solver Often very fast in practice, but no converge rates AdaBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Slow in practice, i.e. not faster than theory predicts TotalBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Fast in practice Combination of benefits [G.R. & Warmuth, 2004; Warmuth et al., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 33 / 62
Outline Convergence 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 34 / 62
Convergence Want margin g ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 35 / 62
Convergence Want margin g ν Assumption: γ t g Estimate of target: γ t = (min t q=1 γ q ) ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 35 / 62
Convergence Idea: Projections to ˆγ t instead of 0 Corrective: Single constraint min d P N (d, d t ) AdaBoost ν N s.t. d n y n h t (x n ) ˆγ t for r = 1,..., t n=1 Totally corrective: One constraint per past weak hypothesis min (d, d 1 ) TotalBoost ν d P N N s.t. d n y n h q (x n ) ˆγ t for q = 1,..., t n=1 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 36 / 62
Convergence TotalBoost ν 1 Input: S = (x 1, y 1 ),..., (x N, y N ), desired accuracy ν 2 Initialize: d 1 n = 1/N for all n = 1... N 3 Do for t = 1,... 1 Train classifier on {S, d t } and obtain hypothesis h t : x [ 1, 1] and let u t i = y i h t (x i ) 2 Calculate the edge γ t of h t : γ t = d t u t 3 Set γ t = ( min t γ q) ν and solve q=1 d t+1 = argmin (d, d 1 ) {d P N d u q bγ t, for 1 q t}=c t 4 If above infeasible or d t+1 contains a zero then T = t and break 4 Output: f α (x) = T t=1 α th t (x), where the coefficients α t maximize margin over hypotheses set {h 1,..., h T }. Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 37 / 62
Convergence TotalBoost ν 1 Input: S = (x 1, y 1 ),..., (x N, y N ), desired accuracy ν 2 Initialize: d 1 n = 1/N for all n = 1... N 3 Do for t = 1,... 1 Train classifier on {S, d t } and obtain hypothesis h t : x [ 1, 1] and let u t i = y i h t (x i ) 2 Calculate the edge γ t of h t : γ t = d t u t 3 Set γ t = ( min t γ q) ν and solve q=1 d t+1 = argmin (d, d 1 ) {d P N d u q bγ t, for 1 q t}=c t 4 If above infeasible or d t+1 contains a zero then T = t and break 4 Output: f α (x) = T t=1 α th t (x), where the coefficients α t maximize margin over hypotheses set {h 1,..., h T }. Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 37 / 62
Convergence TotalBoost ν 1 Input: S = (x 1, y 1 ),..., (x N, y N ), desired accuracy ν 2 Initialize: d 1 n = 1/N for all n = 1... N 3 Do for t = 1,... 1 Train classifier on {S, d t } and obtain hypothesis h t : x [ 1, 1] and let u t i = y i h t (x i ) 2 Calculate the edge γ t of h t : γ t = d t u t 3 Set γ t = ( min t γ q) ν and solve q=1 doptimization t+1 = Problem argmin (d, d 1 ) {d P N d u q bγ t, for 1 q t}=c t 4 Ifd t+1 above= infeasible argminor (d, d t+1 dcontains 1 ) a zero d C t then T = t and break with C t := {d P N d u q γ t, for 1 q t} 4 Output: f α (x) = T t=1 α th t (x), where the coefficients α t maximize margin over hypotheses set {h 1,..., h T }. Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 37 / 62
Convergence Effect of entropy regularization High weight given to hard examples Softmin of margins dn t+1 exp( y n f α t+1(x n ) }{{} ) margin of ex. n α t+1 are current coefficients of hypotheses Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 38 / 62
Convergence Lower Bounding the Progress Theorem For d t, d t+1 P N and u t [ 1, 1] N, if (d t+1, d t ) finite and d t+1 u t d t u t then (d t+1, d t ) > (dt+1 u t d t u t ) 2 2 d t u t = γ t d t+1 u t γ t γ t ν Thus d t u t d t+1 u t ν and (d t+1, d t ) > ν2 2 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 39 / 62
Convergence Generalized Pythagorean Theorem C t = {d P N d u q γ t, 1 q t}, C 0 = P N, C t C t 1 d t is projection of d 1 onto C t 1 at iteration t 1 d t = argmin d C t 1 (d, d 1 ) (d t+1, d 1 ) (d t, d 1 ) + (d t+1, d t ) [Herbster & Warmuth, 2001] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 40 / 62
Sketch of Proof Convergence 1: (d 2, d 1 ) (d 1, d 1 ) (d 2, d 1 ) > ν2 2 2: (d 3, d 1 ) (d 2, d 1 ) (d 3, d 2 ) > ν2 2 t: (d t+1, d 1 ) (d t, d 1 ) (d t+1, d t ) > ν2 2 T 2: (d T 1, d 1 ) (d T 2, d 1 ) (d T 1, d T 2 ) > ν2 2 T 1: (d T, d 1 ) (d T 1, d 1 ) (d T, d T 1 ) > ν2 2 [Warmuth, Liao & G.R., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 41 / 62
Cancellation Convergence 0 {}}{ 1: (d 2, d 1 ) (d 1, d 1 ) (d 2, d 1 ) > ν2 2 2: (d 3, d 1 ) (d 2, d 1 ) (d 3, d 2 ) > ν2 2 t: (d t+1, d 1 ) (d t, d 1 ) (d t+1, d t ) > ν2 2 T 2: (d T 1, d 1 ) (d T 2, d 1 ) (d T 1, d T 2 ) > ν2 2 T 1: (d T, d 1 ) (d }{{} T 1, d 1 ) (d T, d T 1 ) > ν2 2 ln N Therefore, T 2 ln N ν 2 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 42 / 62
Overview Convergence [Long & Wu, 2002] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 43 / 62
Convergence Iteration Bounds for Other Variants Using the same techniques, we can prove iteration bound for: TotalBoost ν which optimizes the divergence (d, d t ) to the last distribution d t TotalBoost ν which uses the binary relative entropy 2 (d, d 1 ) or 2 (d, d t ) as the divergence The variant of AdaBoost ν which terminates when γ t < γ t Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 44 / 62
Convergence TotalBoost ν which optimizes (d, d t ) d t+1 is projection of d t onto C t at iteration t d t+1 = argmin d C t (d, d t ) (d T, d t ) (d T, d t+1 ) + (d t+1, d t ) Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 45 / 62
Sketch of Proof Convergence 1: (d T, d 1 ) (d T, d 2 ) (d 2, d 1 ) > ν2 2 2: (d T, d 2 ) (d T, d 3 ) (d 3, d 2 ) > ν2 2 t: (d T, d t ) (d T, d t+1 ) (d t+1, d t ) > ν2 2 T 2: (d T, d T 2 ) (d T, d T 1 ) (d T 1, d T 2 ) > ν2 2 T 1: (d T, d T 1 ) (d T, d T ) (d T, d T 1 ) > ν2 2 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 46 / 62
Cancellation Convergence ln N {}}{ 1: (d T, d 1 ) (d T, d 2 ) (d 2, d 1 ) > ν2 2 2: (d T, d 2 ) (d T, d 3 ) (d 3, d 2 ) > ν2 2 t: (d T, d t ) (d T, d t+1 ) (d t+1, d t ) > ν2 2 T 2: (d T, d T 2 ) (d T, d T 1 ) (d T 1, d T 2 ) > ν2 2 T 1: (d T, d T 1 ) (d T, d T ) (d }{{} T, d T 1 ) > ν2 2 0 Therefore, T 2 ln N ν 2 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 47 / 62
Convergence TotalBoost ν which optimizes (d, d t ) Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 48 / 62
Outline Illustrative Experiments 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments [Warmuth, Liao & G.R., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 49 / 62
Illustrative Experiments Illustrative Experiments Cox-1 dataset from Telik Inc. Relatively small drug-design data set 125 binary labeled examples 3888 binary features Compare convergence of margin versus number of iterations [Warmuth, Liao & G.R., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 50 / 62
Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 51 / 62
Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 51 / 62
Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 51 / 62
Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 51 / 62
Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 51 / 62
Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 51 / 62
Illustrative Experiments LPBoost May Perform Much Worse Than TotalBoost Identified cases where LPBoost converges considerably slower than TotalBoost ν Dataset is a series of artificial datasets of 1000 examples with varying number of features created as follows: First generated N 1 random ±1-valued features x 1,..., x N1 and set the label of the examples as y = sign(x 1 + x 2 + x 3 + x 4 + x 5 ) Then duplicated each features N 2 times, perturbed the features by Gaussian noise with σ = 0.1, and clipped the feature values so that they lie in the interval [-1,1] Considered different N 1, N 2, the total number of features is N 2 N 1 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 52 / 62
Illustrative Experiments LPBoost performs worse for high dimensional data with many redundant features LPBoost vs. TotalBoostν on two 100,000 dimensional datasets: [left] many redundant features (N1 = 1, 000, N2 = 100) and [right] independent features (N1 = 100, 000, N2 = 1). Show margin vs. number of iterations Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 53 / 62
Illustrative Experiments Bound Not True for LPBoost A counter example: Examples Hypothesis No. 1 2 3 4 5 6 + - - - - - + - - - - - + - - - - - + - - - - - + - - - - - - + - - - + - - + - - + - - - + - + - - - - + + Table 4.1: A toy dataset. Each row corresponds an example and each column corresponds a hypothesis. +: correct prediction -: incorrect prediction If the prediction of an example by a hypothesis is correct, it is displayed as + otherwise displayed as TotalBoost -. averages hypothesis 1 and 6 (i.e. 2 iterations) to achieve maximum margin 0 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 54 / 62
Illustrative Experiments Bound Not True for LPBoost Distribuition of Examples Hypothesis No. Iteration No. 1 2 3 4 5 6 1 2 3 4 5 + - - - - - 1/9 0 0 0 0 + - - - - - 1/9 0 0 0 0 + - - - - - 1/9 0 0 0 0 + - - - - - 1/9 0 0 0 0 + - - - - - 1/9 0 0 0 0 - + - - - + 1/9 1 0 0 0 - - + - - + 1/9 0 1 0 0 - - - + - + 1/9 0 0 1 0 - - - - + + 1/9 0 0 0 1 Selected Hypothesis No. 1 2 3 4 5 Table 4.2: The first five iterations of LPBoost. We show the distribution of examples and the selected Simplex-based hypothesis No. at LPBoost each iteration. uses 5 iterations ((N+1)/2 iterations) to achieve margin 0 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 55 / 62
Illustrative Experiments Regularized LPBoost LPBoost makes edge constraints as tight as possible Picking solutions at the corners can lead to slow convergence. Interior points methods avoid corners Regularized LPBoost: pick solution that minimizes relative entropy to initial distribution. It is identical to TotalBoost ν but the latter algorithm uses a higher edge bound Open problem: find an example where all versions of LPBoost need O(N) iterations Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 56 / 62
Illustrative Experiments Algorithms for Feature Selection We test for each algorithm: Whether selected hypotheses are redundant Size of final hypothesis Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 57 / 62
Illustrative Experiments Redundancy of Selected Hypotheses Test which algorithm selects redundant hypotheses Dataset: Dataset 2 was created by expanding Rudin s dataset. Dataset 2 has 100 blocks of features. Each block contains one original feature and 99 mutations of this feature. Each mutation feature inverts the feature value of one randomly chosen example (with replacement). The mutation features are considered redundant hypotheses. A good algorithm would avoid repeatedly selecting hypotheses from the same block. Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 58 / 62
Illustrative Experiments Experiment on Redundancy of Selected Hypotheses Show number of selected blocks v.s. number of selected hypotheses Redundancy of selected hypotheses: AdaBoost > TotalBoost ν (ν=0.01) > LPBoost If ρ is known, TotalBoost g ν selects one hypothesis per block (not shown). Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 59 / 62
Illustrative Experiments Size of Final Hypothesis Show margins v.s. number of used hypotheses (nonzero weight) TotalBoost ν (ν=0.01) and LPBoost use a small number of hypotheses in final hypothesis Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 60 / 62
Summary Illustrative Experiments AdaBoost can be viewed as entropy projection TotalBoost projects based on all previous hypotheses Provably maximizes the margin Theory: as fast as AdaBoost ν Practice: much faster ( LPBoost) Versatile techniques for proving iteration bounds Experiments corroborate our theory Good for feature selection LPBoost may have problems of maximizing the margin Future: extension to the soft margin case Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 61 / 62
Illustrative Experiments Iteration Bound for Variant of AdaBoost ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, 2007 62 / 62