Totally Corrective Boosting Algorithms that Maximize the Margin

Size: px

Start display at page:

Download "Totally Corrective Boosting Algorithms that Maximize the Margin"

Buck Foster
5 years ago
Views:

1 Totally Corrective Boosting Algorithms that Maximize the Margin Manfred K. Warmuth 1 Jun Liao 1 Gunnar Rätsch 2 1 University of California, Santa Cruz 2 Friedrich Miescher Laboratory, Tübingen, Germany Last update: March 2, 2007 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

2 Outline 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

3 Outline Basics 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

4 Protocol of Boosting Basics Maintain a distribution d t on the examples At iteration t = 1,..., T : 1 Receive a weak hypothesis h t 2 Update d t to d t+1, put more weights on hard examples Output a convex combination of the weak hypotheses f α (x) = T α t h t (x) t=1 [Freund & Schapire, 1995] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

5 Protocol of Boosting Basics Maintain a distribution d t on the examples At iteration t = 1,..., T : 1 Receive a weak hypothesis h t 2 Update d t to d t+1, put more weights on hard examples Output a convex combination of the weak hypotheses f α (x) = T α t h t (x) t=1 [Freund & Schapire, 1995] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

6 Protocol of Boosting Basics Maintain a distribution d t on the examples At iteration t = 1,..., T : 1 Receive a weak hypothesis h t 2 Update d t to d t+1, put more weights on hard examples Output a convex combination of the weak hypotheses f α (x) = T α t h t (x) t=1 [Freund & Schapire, 1995] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

7 Boosting: 1st Iteration Basics First hypothesis: 2 Error rate: 11 N ɛ t = dni(h t t (x n ) = y n ) n=1 Edge: 9 γ t = 22 N dny t n h t (x n ) n=1 = 1 2ɛ t Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

8 Boosting: 1st Iteration Basics First hypothesis: 2 Error rate: 11 N ɛ t = dni(h t t (x n ) = y n ) n=1 Edge: 9 γ t = 22 N dny t n h t (x n ) n=1 = 1 2ɛ t Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

9 Update Distribution Basics Misclassified examples Increased weights After update: Error rate: ɛ(h t, d t+1 ) = 1 2 Edge: γ(h t, d t+1 ) = 0 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

10 Update Distribution Basics Misclassified examples Increased weights After update: Error rate: ɛ(h t, d t+1 ) = 1 2 Edge: γ(h t, d t+1 ) = 0 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

11 Basics Ada-Boost as Entropy Projection Minimize relative entropy to last distribution subject to constraint where min (d, d t ) d N s.t. d n y n h t (x n ) = 0 n=1 d P N (d, d t ) = N n=1 d n ln dn and dn t P N is the N dimensional probability simplex [Lafferty, 1999; Kivinen & Warmuth, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

12 Before 2nd Iteration Basics Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

13 Basics Boosting: 2nd Hypothesis AdaBoost assumption: Edge γ > ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

14 Update Distribution Basics Edge γ = 0 AdaBoost update sets edge of last hypothesis to 0 Number of iterations: 2 ln N ν 2 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

15 Which constraints? Basics Corrective - Ada-Boost: Single constraint min d P N (d, d t ) s.t. N d n y n h t (x n ) = 0 n=1 for n = 1,..., t Totally corrective: One constraint per past weak hypothesis min (d, d 1 ) d P N s.t. N d n y n h q (x n ) = 0 n=1 for q = 1,..., t [Lafferty, 1999; Kivinen & Warmuth, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

16 Which constraints? Basics Corrective - Ada-Boost: Single constraint min d P N (d, d t ) s.t. N d n y n h t (x n ) = 0 n=1 for n = 1,..., t Totally corrective: One constraint per past weak hypothesis min (d, d 1 ) d P N s.t. N d n y n h q (x n ) = 0 n=1 for q = 1,..., t [Lafferty, 1999; Kivinen & Warmuth, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

17 Basics Boosting: 3nd Hypothesis Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

18 Basics Boosting: 4th Hypothesis Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

19 All Hypotheses Basics Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

20 Basics Decision: f α (x) = T t=1 α th t (x) > 0? Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

21 Outline Large margins 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

22 Large margins Large margins in addition to correct classification Margin of the combined hypothesis f α for example (x n, y n ) ρ n (α) = y n f α (x n ) T = y n α t h t (x n ) (α P T ) t=1 Margin of set of examples is minimum over examples ρ(α) := min ρ n (α) n [Freund, Schapire, Bartlett & Lee, 1998] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

23 Large margins Large Margin and Linear Separation Input space X Feature space F Φ Linear separation in F is nonlinear separation in X h 1 (x) Φ(x) = h 2 (x). H = {h 1, h 2,...} [Mangasarian, 1999; G.R., Mika, Schölkopf & Müller, 2002] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

24 Margin vs. edge Large margins Margin Measure for confidence in prediction for a hypothesis weighting Margin of example n for current hypothesis weighting α T ρ n (α) = y n f α (x n ) = y n α t h t (x n ) α P T Edge Measurement of goodness of a hypothesis w.r.t. a distribution Edge of a hypothesis h for a distribution d on the examples N γ h (d) = d n y n h(x n ) d P N What is the connection? n=1 t=1 [Breiman, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

25 Margin vs. edge Large margins Margin Measure for confidence in prediction for a hypothesis weighting Margin of example n for current hypothesis weighting α T ρ n (α) = y n f α (x n ) = y n α t h t (x n ) α P T Edge Measurement of goodness of a hypothesis w.r.t. a distribution Edge of a hypothesis h for a distribution d on the examples N γ h (d) = d n y n h(x n ) d P N What is the connection? n=1 t=1 [Breiman, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

26 Margin vs. edge Large margins Margin Measure for confidence in prediction for a hypothesis weighting Margin of example n for current hypothesis weighting α T ρ n (α) = y n f α (x n ) = y n α t h t (x n ) α P T Edge Measurement of goodness of a hypothesis w.r.t. a distribution Edge of a hypothesis h for a distribution d on the examples N γ h (d) = d n y n h(x n ) d P N What is the connection? n=1 t=1 [Breiman, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

27 Large margins von Neumann s Minimax-Theorem Set of examples S = {(x 1, y 1 ),..., (x N, y N )} and hypotheses set H t = {h 1,..., h t }, minimum edge: γ t =min maximum margin: ρ t =max max d P N h H t min α P t n edge of h {}}{ γ h (d) } {{ } edge of H t margin of (x n, y n) {}}{ y n f α (x n ) } {{ } margin of S Duality: γ t =ρ t [von Neumann, 1928] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

28 Large margins von Neumann s Minimax-Theorem Set of examples S = {(x 1, y 1 ),..., (x N, y N )} and hypotheses set H t = {h 1,..., h t }, minimum edge: γ t =min d P N maximum margin: ρ t =max t max q=1 { edge of h q }} { N n=1 d n y n h q (x n ) } {{ } edge of H t min y n α P t n q=1 Duality: γ t =ρ t margin of (x n, y n) {}}{ t α t h q (x n ) } {{ } margin of S Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

29 Outline Games 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

30 Games Two-player Zero Sum Game Rock, Paper, Scissors game column player R P S α 1 α 2 α 3 row player R d P d S d gain matrix Single row is pure strategy of row player and d is mixed strategy Single column is pure strategy of column player and α is mixed strategy Row player minimizes Column player maximizes payoff = d T M α = i,j d im i,j α j [Freund and Schapire, 1997] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

31 Optimum Strategy Games Min-max theorem: R P S α 1 α 2 α R d P d S d min max d α dt Mα = min max d T Me j d j = max min d T Mα = max min e i Mα α d α i = value of the game ( 0 in example ) e j is pure strategy Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

32 Games Connection to Boosting? Rows are the examples Columns the weak hypothesis M i,j = h j (x i )y i Row sum: margin of example Column sum: edge of weak hypothesis Value of game: γ = ρ Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

33 Games New column added: boosting R P S α 1 α 2 α 3 α 4 margin R d P d S d edge Value of game increases from 0 to.11 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

34 Games Row added: on-line learning R P S α 1 α 2 α 3 margin R d P d S d d edge Value of game decreases from 0 to -.11 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

35 Games Boosting: maximize margin incrementally α1 1 d1 1 0 d2 1 1 d3 1-1 α 2 1 α 2 2 d d d α 3 1 α 3 2 α 3 3 d d d iteration 1 iteration 2 iteration 3 In each iteration solve optimization problem to update d Column player adds new column - weak hypothesis Some assumptions will be needed on the edge of the added hypothesis Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

36 Value non-decreasing Games γ, ρ : edge/margin for all hypotheses Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

37 Duality gap Games For any non-optimal d P N and α P t, γ(d) γ t = ρ t ρ(α) Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

38 Games How Large is the Maximal Margin? Assumptions on Weak learner For any distribution d on the examples, the weak learner returns a hypothesis h with edge γ h (d) at least g. Best case: g = ρ = γ [Breiman, 1999; Bennett et al.; G.R. et al., 2001; Rudin et al., 2004] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

39 Games How Large is the Maximal Margin? Assumptions on Weak learner For any distribution d on the examples, the weak learner returns a hypothesis h with edge γ h (d) at least g. Best case: g = ρ = γ Implication from Minimax Theorem There exists α P N, such that ρ(α) g [Breiman, 1999; Bennett et al.; G.R. et al., 2001; Rudin et al., 2004] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

40 Games How Large is the Maximal Margin? Assumptions on Weak learner For any distribution d on the examples, the weak learner returns a hypothesis h with edge γ h (d) at least g. Best case: g = ρ = γ Implication from Minimax Theorem There exists α P N, such that ρ(α) g Idea to iteratively solve LP: LPBoost Add best hypothesis h = argmax γ h (d t ) to H t and resolve d t+1 = argmin max γ h(d) d P N h H t [Breiman, 1999; Bennett et al.; G.R. et al., 2001; Rudin et al., 2004] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

41 Convergence? Games LPBoost? No iteration bounds known AdaBoost? May oscillate Does not find maximizing α (counter examples) But there some guarantees: ρ(α t ) 0 after 2 ln N/g 2 iterations ρ(α t ) g/2 in the limit [Breiman, 1999; Bennett et al.; G.R. et al., 2001; Rudin et al., 2004] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

42 Games How to maximize the margin? Modify AdaBoost for maximizing margin Arc-GV asymptotically maximizes the margin quite slow, no convergence rates LPBoost uses a Linear Programming solver Often very fast in practice, but no converge rates AdaBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Slow in practice, i.e. not faster than theory predicts TotalBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Fast in practice Combination of benefits [G.R. & Warmuth, 2004; Warmuth et al., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

43 Games How to maximize the margin? Modify AdaBoost for maximizing margin Arc-GV asymptotically maximizes the margin quite slow, no convergence rates LPBoost uses a Linear Programming solver Often very fast in practice, but no converge rates AdaBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Slow in practice, i.e. not faster than theory predicts TotalBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Fast in practice Combination of benefits [G.R. & Warmuth, 2004; Warmuth et al., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

44 Games How to maximize the margin? Modify AdaBoost for maximizing margin Arc-GV asymptotically maximizes the margin quite slow, no convergence rates LPBoost uses a Linear Programming solver Often very fast in practice, but no converge rates AdaBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Slow in practice, i.e. not faster than theory predicts TotalBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Fast in practice Combination of benefits [G.R. & Warmuth, 2004; Warmuth et al., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

45 Games How to maximize the margin? Modify AdaBoost for maximizing margin Arc-GV asymptotically maximizes the margin quite slow, no convergence rates LPBoost uses a Linear Programming solver Often very fast in practice, but no converge rates AdaBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Slow in practice, i.e. not faster than theory predicts TotalBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Fast in practice Combination of benefits [G.R. & Warmuth, 2004; Warmuth et al., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

46 Outline Convergence 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

47 Convergence Want margin g ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

48 Convergence Want margin g ν Assumption: γ t g Estimate of target: γ t = (min t q=1 γ q ) ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

49 Convergence Idea: Projections to ˆγ t instead of 0 Corrective: Single constraint min d P N (d, d t ) AdaBoost ν N s.t. d n y n h t (x n ) ˆγ t for r = 1,..., t n=1 Totally corrective: One constraint per past weak hypothesis min (d, d 1 ) TotalBoost ν d P N N s.t. d n y n h q (x n ) ˆγ t for q = 1,..., t n=1 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

50 Convergence TotalBoost ν 1 Input: S = (x 1, y 1 ),..., (x N, y N ), desired accuracy ν 2 Initialize: d 1 n = 1/N for all n = 1... N 3 Do for t = 1,... 1 Train classifier on {S, d t } and obtain hypothesis h t : x [ 1, 1] and let u t i = y i h t (x i ) 2 Calculate the edge γ t of h t : γ t = d t u t 3 Set γ t = ( min t γ q) ν and solve q=1 d t+1 = argmin (d, d 1 ) {d P N d u q bγ t, for 1 q t}=c t 4 If above infeasible or d t+1 contains a zero then T = t and break 4 Output: f α (x) = T t=1 α th t (x), where the coefficients α t maximize margin over hypotheses set {h 1,..., h T }. Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

51 Convergence TotalBoost ν 1 Input: S = (x 1, y 1 ),..., (x N, y N ), desired accuracy ν 2 Initialize: d 1 n = 1/N for all n = 1... N 3 Do for t = 1,... 1 Train classifier on {S, d t } and obtain hypothesis h t : x [ 1, 1] and let u t i = y i h t (x i ) 2 Calculate the edge γ t of h t : γ t = d t u t 3 Set γ t = ( min t γ q) ν and solve q=1 d t+1 = argmin (d, d 1 ) {d P N d u q bγ t, for 1 q t}=c t 4 If above infeasible or d t+1 contains a zero then T = t and break 4 Output: f α (x) = T t=1 α th t (x), where the coefficients α t maximize margin over hypotheses set {h 1,..., h T }. Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

52 Convergence TotalBoost ν 1 Input: S = (x 1, y 1 ),..., (x N, y N ), desired accuracy ν 2 Initialize: d 1 n = 1/N for all n = 1... N 3 Do for t = 1,... 1 Train classifier on {S, d t } and obtain hypothesis h t : x [ 1, 1] and let u t i = y i h t (x i ) 2 Calculate the edge γ t of h t : γ t = d t u t 3 Set γ t = ( min t γ q) ν and solve q=1 doptimization t+1 = Problem argmin (d, d 1 ) {d P N d u q bγ t, for 1 q t}=c t 4 Ifd t+1 above= infeasible argminor (d, d t+1 dcontains 1 ) a zero d C t then T = t and break with C t := {d P N d u q γ t, for 1 q t} 4 Output: f α (x) = T t=1 α th t (x), where the coefficients α t maximize margin over hypotheses set {h 1,..., h T }. Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

53 Convergence Effect of entropy regularization High weight given to hard examples Softmin of margins dn t+1 exp( y n f α t+1(x n ) }{{} ) margin of ex. n α t+1 are current coefficients of hypotheses Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

54 Convergence Lower Bounding the Progress Theorem For d t, d t+1 P N and u t [ 1, 1] N, if (d t+1, d t ) finite and d t+1 u t d t u t then (d t+1, d t ) > (dt+1 u t d t u t ) 2 2 d t u t = γ t d t+1 u t γ t γ t ν Thus d t u t d t+1 u t ν and (d t+1, d t ) > ν2 2 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

55 Convergence Generalized Pythagorean Theorem C t = {d P N d u q γ t, 1 q t}, C 0 = P N, C t C t 1 d t is projection of d 1 onto C t 1 at iteration t 1 d t = argmin d C t 1 (d, d 1 ) (d t+1, d 1 ) (d t, d 1 ) + (d t+1, d t ) [Herbster & Warmuth, 2001] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

56 Sketch of Proof Convergence 1: (d 2, d 1 ) (d 1, d 1 ) (d 2, d 1 ) > ν2 2 2: (d 3, d 1 ) (d 2, d 1 ) (d 3, d 2 ) > ν2 2 t: (d t+1, d 1 ) (d t, d 1 ) (d t+1, d t ) > ν2 2 T 2: (d T 1, d 1 ) (d T 2, d 1 ) (d T 1, d T 2 ) > ν2 2 T 1: (d T, d 1 ) (d T 1, d 1 ) (d T, d T 1 ) > ν2 2 [Warmuth, Liao & G.R., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

57 Cancellation Convergence 0 {}}{ 1: (d 2, d 1 ) (d 1, d 1 ) (d 2, d 1 ) > ν2 2 2: (d 3, d 1 ) (d 2, d 1 ) (d 3, d 2 ) > ν2 2 t: (d t+1, d 1 ) (d t, d 1 ) (d t+1, d t ) > ν2 2 T 2: (d T 1, d 1 ) (d T 2, d 1 ) (d T 1, d T 2 ) > ν2 2 T 1: (d T, d 1 ) (d }{{} T 1, d 1 ) (d T, d T 1 ) > ν2 2 ln N Therefore, T 2 ln N ν 2 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

58 Overview Convergence [Long & Wu, 2002] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

59 Convergence Iteration Bounds for Other Variants Using the same techniques, we can prove iteration bound for: TotalBoost ν which optimizes the divergence (d, d t ) to the last distribution d t TotalBoost ν which uses the binary relative entropy 2 (d, d 1 ) or 2 (d, d t ) as the divergence The variant of AdaBoost ν which terminates when γ t < γ t Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

60 Convergence TotalBoost ν which optimizes (d, d t ) d t+1 is projection of d t onto C t at iteration t d t+1 = argmin d C t (d, d t ) (d T, d t ) (d T, d t+1 ) + (d t+1, d t ) Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

61 Sketch of Proof Convergence 1: (d T, d 1 ) (d T, d 2 ) (d 2, d 1 ) > ν2 2 2: (d T, d 2 ) (d T, d 3 ) (d 3, d 2 ) > ν2 2 t: (d T, d t ) (d T, d t+1 ) (d t+1, d t ) > ν2 2 T 2: (d T, d T 2 ) (d T, d T 1 ) (d T 1, d T 2 ) > ν2 2 T 1: (d T, d T 1 ) (d T, d T ) (d T, d T 1 ) > ν2 2 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

62 Cancellation Convergence ln N {}}{ 1: (d T, d 1 ) (d T, d 2 ) (d 2, d 1 ) > ν2 2 2: (d T, d 2 ) (d T, d 3 ) (d 3, d 2 ) > ν2 2 t: (d T, d t ) (d T, d t+1 ) (d t+1, d t ) > ν2 2 T 2: (d T, d T 2 ) (d T, d T 1 ) (d T 1, d T 2 ) > ν2 2 T 1: (d T, d T 1 ) (d T, d T ) (d }{{} T, d T 1 ) > ν2 2 0 Therefore, T 2 ln N ν 2 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

63 Convergence TotalBoost ν which optimizes (d, d t ) Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

64 Outline Illustrative Experiments 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments [Warmuth, Liao & G.R., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

65 Illustrative Experiments Illustrative Experiments Cox-1 dataset from Telik Inc. Relatively small drug-design data set 125 binary labeled examples 3888 binary features Compare convergence of margin versus number of iterations [Warmuth, Liao & G.R., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

66 Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

67 Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

68 Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

69 Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

70 Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

71 Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

72 Illustrative Experiments LPBoost May Perform Much Worse Than TotalBoost Identified cases where LPBoost converges considerably slower than TotalBoost ν Dataset is a series of artificial datasets of 1000 examples with varying number of features created as follows: First generated N 1 random ±1-valued features x 1,..., x N1 and set the label of the examples as y = sign(x 1 + x 2 + x 3 + x 4 + x 5 ) Then duplicated each features N 2 times, perturbed the features by Gaussian noise with σ = 0.1, and clipped the feature values so that they lie in the interval [-1,1] Considered different N 1, N 2, the total number of features is N 2 N 1 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

73 Illustrative Experiments LPBoost performs worse for high dimensional data with many redundant features LPBoost vs. TotalBoostν on two 100,000 dimensional datasets: [left] many redundant features (N1 = 1, 000, N2 = 100) and [right] independent features (N1 = 100, 000, N2 = 1). Show margin vs. number of iterations Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

74 Illustrative Experiments Bound Not True for LPBoost A counter example: Examples Hypothesis No Table 4.1: A toy dataset. Each row corresponds an example and each column corresponds a hypothesis. +: correct prediction -: incorrect prediction If the prediction of an example by a hypothesis is correct, it is displayed as + otherwise displayed as TotalBoost -. averages hypothesis 1 and 6 (i.e. 2 iterations) to achieve maximum margin 0 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

75 Illustrative Experiments Bound Not True for LPBoost Distribuition of Examples Hypothesis No. Iteration No / / / / / / / / / Selected Hypothesis No Table 4.2: The first five iterations of LPBoost. We show the distribution of examples and the selected Simplex-based hypothesis No. at LPBoost each iteration. uses 5 iterations ((N+1)/2 iterations) to achieve margin 0 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

76 Illustrative Experiments Regularized LPBoost LPBoost makes edge constraints as tight as possible Picking solutions at the corners can lead to slow convergence. Interior points methods avoid corners Regularized LPBoost: pick solution that minimizes relative entropy to initial distribution. It is identical to TotalBoost ν but the latter algorithm uses a higher edge bound Open problem: find an example where all versions of LPBoost need O(N) iterations Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

77 Illustrative Experiments Algorithms for Feature Selection We test for each algorithm: Whether selected hypotheses are redundant Size of final hypothesis Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

78 Illustrative Experiments Redundancy of Selected Hypotheses Test which algorithm selects redundant hypotheses Dataset: Dataset 2 was created by expanding Rudin s dataset. Dataset 2 has 100 blocks of features. Each block contains one original feature and 99 mutations of this feature. Each mutation feature inverts the feature value of one randomly chosen example (with replacement). The mutation features are considered redundant hypotheses. A good algorithm would avoid repeatedly selecting hypotheses from the same block. Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

Illustrative Experiments Experiment on Redundancy of Selected Hypotheses Show number of selected blocks v.s. number of selected hypotheses Redundancy of selected hypotheses: AdaBoost > TotalBoost ν (ν=0.

79 Illustrative Experiments Experiment on Redundancy of Selected Hypotheses Show number of selected blocks v.s. number of selected hypotheses Redundancy of selected hypotheses: AdaBoost > TotalBoost ν (ν=0.01) > LPBoost If ρ is known, TotalBoost g ν selects one hypothesis per block (not shown). Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

80 Illustrative Experiments Size of Final Hypothesis Show margins v.s. number of used hypotheses (nonzero weight) TotalBoost ν (ν=0.01) and LPBoost use a small number of hypotheses in final hypothesis Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

81 Summary Illustrative Experiments AdaBoost can be viewed as entropy projection TotalBoost projects based on all previous hypotheses Provably maximizes the margin Theory: as fast as AdaBoost ν Practice: much faster ( LPBoost) Versatile techniques for proving iteration bounds Experiments corroborate our theory Good for feature selection LPBoost may have problems of maximizing the margin Future: extension to the soft margin case Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

82 Illustrative Experiments Iteration Bound for Variant of AdaBoost ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62

Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62 Updated: March 23, 2010 Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62 ICML 2009 Tutorial Survey of Boosting from an Optimization Perspective Part I: Entropy Regularized LPBoost Part II: Boosting from