Totally Corrective Boosting Algorithms that Maximize the Margin

Size: px
Start display at page:

Download "Totally Corrective Boosting Algorithms that Maximize the Margin"

Transcription

1 Totally Corrective Boosting Algorithms that Maximize the Margin Manfred K. Warmuth 1 Jun Liao 1 Gunnar Rätsch 2 1 University of California, Santa Cruz 2 Friedrich Miescher Laboratory, Tübingen, Germany Last update: March 2, 2007 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

2 Outline 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

3 Outline Basics 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

4 Protocol of Boosting Basics Maintain a distribution d t on the examples At iteration t = 1,..., T : 1 Receive a weak hypothesis h t 2 Update d t to d t+1, put more weights on hard examples Output a convex combination of the weak hypotheses f α (x) = T α t h t (x) t=1 [Freund & Schapire, 1995] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

5 Protocol of Boosting Basics Maintain a distribution d t on the examples At iteration t = 1,..., T : 1 Receive a weak hypothesis h t 2 Update d t to d t+1, put more weights on hard examples Output a convex combination of the weak hypotheses f α (x) = T α t h t (x) t=1 [Freund & Schapire, 1995] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

6 Protocol of Boosting Basics Maintain a distribution d t on the examples At iteration t = 1,..., T : 1 Receive a weak hypothesis h t 2 Update d t to d t+1, put more weights on hard examples Output a convex combination of the weak hypotheses f α (x) = T α t h t (x) t=1 [Freund & Schapire, 1995] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

7 Boosting: 1st Iteration Basics First hypothesis: 2 Error rate: 11 N ɛ t = dni(h t t (x n ) = y n ) n=1 Edge: 9 γ t = 22 N dny t n h t (x n ) n=1 = 1 2ɛ t Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

8 Boosting: 1st Iteration Basics First hypothesis: 2 Error rate: 11 N ɛ t = dni(h t t (x n ) = y n ) n=1 Edge: 9 γ t = 22 N dny t n h t (x n ) n=1 = 1 2ɛ t Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

9 Update Distribution Basics Misclassified examples Increased weights After update: Error rate: ɛ(h t, d t+1 ) = 1 2 Edge: γ(h t, d t+1 ) = 0 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

10 Update Distribution Basics Misclassified examples Increased weights After update: Error rate: ɛ(h t, d t+1 ) = 1 2 Edge: γ(h t, d t+1 ) = 0 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

11 Basics Ada-Boost as Entropy Projection Minimize relative entropy to last distribution subject to constraint where min (d, d t ) d N s.t. d n y n h t (x n ) = 0 n=1 d P N (d, d t ) = N n=1 d n ln dn and dn t P N is the N dimensional probability simplex [Lafferty, 1999; Kivinen & Warmuth, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

12 Before 2nd Iteration Basics Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

13 Basics Boosting: 2nd Hypothesis AdaBoost assumption: Edge γ > ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

14 Update Distribution Basics Edge γ = 0 AdaBoost update sets edge of last hypothesis to 0 Number of iterations: 2 ln N ν 2 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

15 Which constraints? Basics Corrective - Ada-Boost: Single constraint min d P N (d, d t ) s.t. N d n y n h t (x n ) = 0 n=1 for n = 1,..., t Totally corrective: One constraint per past weak hypothesis min (d, d 1 ) d P N s.t. N d n y n h q (x n ) = 0 n=1 for q = 1,..., t [Lafferty, 1999; Kivinen & Warmuth, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

16 Which constraints? Basics Corrective - Ada-Boost: Single constraint min d P N (d, d t ) s.t. N d n y n h t (x n ) = 0 n=1 for n = 1,..., t Totally corrective: One constraint per past weak hypothesis min (d, d 1 ) d P N s.t. N d n y n h q (x n ) = 0 n=1 for q = 1,..., t [Lafferty, 1999; Kivinen & Warmuth, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

17 Basics Boosting: 3nd Hypothesis Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

18 Basics Boosting: 4th Hypothesis Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

19 All Hypotheses Basics Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

20 Basics Decision: f α (x) = T t=1 α th t (x) > 0? Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

21 Outline Large margins 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

22 Large margins Large margins in addition to correct classification Margin of the combined hypothesis f α for example (x n, y n ) ρ n (α) = y n f α (x n ) T = y n α t h t (x n ) (α P T ) t=1 Margin of set of examples is minimum over examples ρ(α) := min ρ n (α) n [Freund, Schapire, Bartlett & Lee, 1998] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

23 Large margins Large Margin and Linear Separation Input space X Feature space F Φ Linear separation in F is nonlinear separation in X h 1 (x) Φ(x) = h 2 (x). H = {h 1, h 2,...} [Mangasarian, 1999; G.R., Mika, Schölkopf & Müller, 2002] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

24 Margin vs. edge Large margins Margin Measure for confidence in prediction for a hypothesis weighting Margin of example n for current hypothesis weighting α T ρ n (α) = y n f α (x n ) = y n α t h t (x n ) α P T Edge Measurement of goodness of a hypothesis w.r.t. a distribution Edge of a hypothesis h for a distribution d on the examples N γ h (d) = d n y n h(x n ) d P N What is the connection? n=1 t=1 [Breiman, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

25 Margin vs. edge Large margins Margin Measure for confidence in prediction for a hypothesis weighting Margin of example n for current hypothesis weighting α T ρ n (α) = y n f α (x n ) = y n α t h t (x n ) α P T Edge Measurement of goodness of a hypothesis w.r.t. a distribution Edge of a hypothesis h for a distribution d on the examples N γ h (d) = d n y n h(x n ) d P N What is the connection? n=1 t=1 [Breiman, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

26 Margin vs. edge Large margins Margin Measure for confidence in prediction for a hypothesis weighting Margin of example n for current hypothesis weighting α T ρ n (α) = y n f α (x n ) = y n α t h t (x n ) α P T Edge Measurement of goodness of a hypothesis w.r.t. a distribution Edge of a hypothesis h for a distribution d on the examples N γ h (d) = d n y n h(x n ) d P N What is the connection? n=1 t=1 [Breiman, 1999] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

27 Large margins von Neumann s Minimax-Theorem Set of examples S = {(x 1, y 1 ),..., (x N, y N )} and hypotheses set H t = {h 1,..., h t }, minimum edge: γ t =min maximum margin: ρ t =max max d P N h H t min α P t n edge of h {}}{ γ h (d) } {{ } edge of H t margin of (x n, y n) {}}{ y n f α (x n ) } {{ } margin of S Duality: γ t =ρ t [von Neumann, 1928] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

28 Large margins von Neumann s Minimax-Theorem Set of examples S = {(x 1, y 1 ),..., (x N, y N )} and hypotheses set H t = {h 1,..., h t }, minimum edge: γ t =min d P N maximum margin: ρ t =max t max q=1 { edge of h q }} { N n=1 d n y n h q (x n ) } {{ } edge of H t min y n α P t n q=1 Duality: γ t =ρ t margin of (x n, y n) {}}{ t α t h q (x n ) } {{ } margin of S Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

29 Outline Games 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

30 Games Two-player Zero Sum Game Rock, Paper, Scissors game column player R P S α 1 α 2 α 3 row player R d P d S d gain matrix Single row is pure strategy of row player and d is mixed strategy Single column is pure strategy of column player and α is mixed strategy Row player minimizes Column player maximizes payoff = d T M α = i,j d im i,j α j [Freund and Schapire, 1997] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

31 Optimum Strategy Games Min-max theorem: R P S α 1 α 2 α R d P d S d min max d α dt Mα = min max d T Me j d j = max min d T Mα = max min e i Mα α d α i = value of the game ( 0 in example ) e j is pure strategy Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

32 Games Connection to Boosting? Rows are the examples Columns the weak hypothesis M i,j = h j (x i )y i Row sum: margin of example Column sum: edge of weak hypothesis Value of game: γ = ρ Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

33 Games New column added: boosting R P S α 1 α 2 α 3 α 4 margin R d P d S d edge Value of game increases from 0 to.11 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

34 Games Row added: on-line learning R P S α 1 α 2 α 3 margin R d P d S d d edge Value of game decreases from 0 to -.11 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

35 Games Boosting: maximize margin incrementally α1 1 d1 1 0 d2 1 1 d3 1-1 α 2 1 α 2 2 d d d α 3 1 α 3 2 α 3 3 d d d iteration 1 iteration 2 iteration 3 In each iteration solve optimization problem to update d Column player adds new column - weak hypothesis Some assumptions will be needed on the edge of the added hypothesis Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

36 Value non-decreasing Games γ, ρ : edge/margin for all hypotheses Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

37 Duality gap Games For any non-optimal d P N and α P t, γ(d) γ t = ρ t ρ(α) Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

38 Games How Large is the Maximal Margin? Assumptions on Weak learner For any distribution d on the examples, the weak learner returns a hypothesis h with edge γ h (d) at least g. Best case: g = ρ = γ [Breiman, 1999; Bennett et al.; G.R. et al., 2001; Rudin et al., 2004] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

39 Games How Large is the Maximal Margin? Assumptions on Weak learner For any distribution d on the examples, the weak learner returns a hypothesis h with edge γ h (d) at least g. Best case: g = ρ = γ Implication from Minimax Theorem There exists α P N, such that ρ(α) g [Breiman, 1999; Bennett et al.; G.R. et al., 2001; Rudin et al., 2004] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

40 Games How Large is the Maximal Margin? Assumptions on Weak learner For any distribution d on the examples, the weak learner returns a hypothesis h with edge γ h (d) at least g. Best case: g = ρ = γ Implication from Minimax Theorem There exists α P N, such that ρ(α) g Idea to iteratively solve LP: LPBoost Add best hypothesis h = argmax γ h (d t ) to H t and resolve d t+1 = argmin max γ h(d) d P N h H t [Breiman, 1999; Bennett et al.; G.R. et al., 2001; Rudin et al., 2004] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

41 Convergence? Games LPBoost? No iteration bounds known AdaBoost? May oscillate Does not find maximizing α (counter examples) But there some guarantees: ρ(α t ) 0 after 2 ln N/g 2 iterations ρ(α t ) g/2 in the limit [Breiman, 1999; Bennett et al.; G.R. et al., 2001; Rudin et al., 2004] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

42 Games How to maximize the margin? Modify AdaBoost for maximizing margin Arc-GV asymptotically maximizes the margin quite slow, no convergence rates LPBoost uses a Linear Programming solver Often very fast in practice, but no converge rates AdaBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Slow in practice, i.e. not faster than theory predicts TotalBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Fast in practice Combination of benefits [G.R. & Warmuth, 2004; Warmuth et al., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

43 Games How to maximize the margin? Modify AdaBoost for maximizing margin Arc-GV asymptotically maximizes the margin quite slow, no convergence rates LPBoost uses a Linear Programming solver Often very fast in practice, but no converge rates AdaBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Slow in practice, i.e. not faster than theory predicts TotalBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Fast in practice Combination of benefits [G.R. & Warmuth, 2004; Warmuth et al., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

44 Games How to maximize the margin? Modify AdaBoost for maximizing margin Arc-GV asymptotically maximizes the margin quite slow, no convergence rates LPBoost uses a Linear Programming solver Often very fast in practice, but no converge rates AdaBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Slow in practice, i.e. not faster than theory predicts TotalBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Fast in practice Combination of benefits [G.R. & Warmuth, 2004; Warmuth et al., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

45 Games How to maximize the margin? Modify AdaBoost for maximizing margin Arc-GV asymptotically maximizes the margin quite slow, no convergence rates LPBoost uses a Linear Programming solver Often very fast in practice, but no converge rates AdaBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Slow in practice, i.e. not faster than theory predicts TotalBoost ν requires 2 log(n) ν 2 iterations to get ρ t [ρ ν, ρ ] Fast in practice Combination of benefits [G.R. & Warmuth, 2004; Warmuth et al., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

46 Outline Convergence 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

47 Convergence Want margin g ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

48 Convergence Want margin g ν Assumption: γ t g Estimate of target: γ t = (min t q=1 γ q ) ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

49 Convergence Idea: Projections to ˆγ t instead of 0 Corrective: Single constraint min d P N (d, d t ) AdaBoost ν N s.t. d n y n h t (x n ) ˆγ t for r = 1,..., t n=1 Totally corrective: One constraint per past weak hypothesis min (d, d 1 ) TotalBoost ν d P N N s.t. d n y n h q (x n ) ˆγ t for q = 1,..., t n=1 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

50 Convergence TotalBoost ν 1 Input: S = (x 1, y 1 ),..., (x N, y N ), desired accuracy ν 2 Initialize: d 1 n = 1/N for all n = 1... N 3 Do for t = 1,... 1 Train classifier on {S, d t } and obtain hypothesis h t : x [ 1, 1] and let u t i = y i h t (x i ) 2 Calculate the edge γ t of h t : γ t = d t u t 3 Set γ t = ( min t γ q) ν and solve q=1 d t+1 = argmin (d, d 1 ) {d P N d u q bγ t, for 1 q t}=c t 4 If above infeasible or d t+1 contains a zero then T = t and break 4 Output: f α (x) = T t=1 α th t (x), where the coefficients α t maximize margin over hypotheses set {h 1,..., h T }. Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

51 Convergence TotalBoost ν 1 Input: S = (x 1, y 1 ),..., (x N, y N ), desired accuracy ν 2 Initialize: d 1 n = 1/N for all n = 1... N 3 Do for t = 1,... 1 Train classifier on {S, d t } and obtain hypothesis h t : x [ 1, 1] and let u t i = y i h t (x i ) 2 Calculate the edge γ t of h t : γ t = d t u t 3 Set γ t = ( min t γ q) ν and solve q=1 d t+1 = argmin (d, d 1 ) {d P N d u q bγ t, for 1 q t}=c t 4 If above infeasible or d t+1 contains a zero then T = t and break 4 Output: f α (x) = T t=1 α th t (x), where the coefficients α t maximize margin over hypotheses set {h 1,..., h T }. Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

52 Convergence TotalBoost ν 1 Input: S = (x 1, y 1 ),..., (x N, y N ), desired accuracy ν 2 Initialize: d 1 n = 1/N for all n = 1... N 3 Do for t = 1,... 1 Train classifier on {S, d t } and obtain hypothesis h t : x [ 1, 1] and let u t i = y i h t (x i ) 2 Calculate the edge γ t of h t : γ t = d t u t 3 Set γ t = ( min t γ q) ν and solve q=1 doptimization t+1 = Problem argmin (d, d 1 ) {d P N d u q bγ t, for 1 q t}=c t 4 Ifd t+1 above= infeasible argminor (d, d t+1 dcontains 1 ) a zero d C t then T = t and break with C t := {d P N d u q γ t, for 1 q t} 4 Output: f α (x) = T t=1 α th t (x), where the coefficients α t maximize margin over hypotheses set {h 1,..., h T }. Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

53 Convergence Effect of entropy regularization High weight given to hard examples Softmin of margins dn t+1 exp( y n f α t+1(x n ) }{{} ) margin of ex. n α t+1 are current coefficients of hypotheses Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

54 Convergence Lower Bounding the Progress Theorem For d t, d t+1 P N and u t [ 1, 1] N, if (d t+1, d t ) finite and d t+1 u t d t u t then (d t+1, d t ) > (dt+1 u t d t u t ) 2 2 d t u t = γ t d t+1 u t γ t γ t ν Thus d t u t d t+1 u t ν and (d t+1, d t ) > ν2 2 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

55 Convergence Generalized Pythagorean Theorem C t = {d P N d u q γ t, 1 q t}, C 0 = P N, C t C t 1 d t is projection of d 1 onto C t 1 at iteration t 1 d t = argmin d C t 1 (d, d 1 ) (d t+1, d 1 ) (d t, d 1 ) + (d t+1, d t ) [Herbster & Warmuth, 2001] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

56 Sketch of Proof Convergence 1: (d 2, d 1 ) (d 1, d 1 ) (d 2, d 1 ) > ν2 2 2: (d 3, d 1 ) (d 2, d 1 ) (d 3, d 2 ) > ν2 2 t: (d t+1, d 1 ) (d t, d 1 ) (d t+1, d t ) > ν2 2 T 2: (d T 1, d 1 ) (d T 2, d 1 ) (d T 1, d T 2 ) > ν2 2 T 1: (d T, d 1 ) (d T 1, d 1 ) (d T, d T 1 ) > ν2 2 [Warmuth, Liao & G.R., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

57 Cancellation Convergence 0 {}}{ 1: (d 2, d 1 ) (d 1, d 1 ) (d 2, d 1 ) > ν2 2 2: (d 3, d 1 ) (d 2, d 1 ) (d 3, d 2 ) > ν2 2 t: (d t+1, d 1 ) (d t, d 1 ) (d t+1, d t ) > ν2 2 T 2: (d T 1, d 1 ) (d T 2, d 1 ) (d T 1, d T 2 ) > ν2 2 T 1: (d T, d 1 ) (d }{{} T 1, d 1 ) (d T, d T 1 ) > ν2 2 ln N Therefore, T 2 ln N ν 2 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

58 Overview Convergence [Long & Wu, 2002] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

59 Convergence Iteration Bounds for Other Variants Using the same techniques, we can prove iteration bound for: TotalBoost ν which optimizes the divergence (d, d t ) to the last distribution d t TotalBoost ν which uses the binary relative entropy 2 (d, d 1 ) or 2 (d, d t ) as the divergence The variant of AdaBoost ν which terminates when γ t < γ t Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

60 Convergence TotalBoost ν which optimizes (d, d t ) d t+1 is projection of d t onto C t at iteration t d t+1 = argmin d C t (d, d t ) (d T, d t ) (d T, d t+1 ) + (d t+1, d t ) Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

61 Sketch of Proof Convergence 1: (d T, d 1 ) (d T, d 2 ) (d 2, d 1 ) > ν2 2 2: (d T, d 2 ) (d T, d 3 ) (d 3, d 2 ) > ν2 2 t: (d T, d t ) (d T, d t+1 ) (d t+1, d t ) > ν2 2 T 2: (d T, d T 2 ) (d T, d T 1 ) (d T 1, d T 2 ) > ν2 2 T 1: (d T, d T 1 ) (d T, d T ) (d T, d T 1 ) > ν2 2 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

62 Cancellation Convergence ln N {}}{ 1: (d T, d 1 ) (d T, d 2 ) (d 2, d 1 ) > ν2 2 2: (d T, d 2 ) (d T, d 3 ) (d 3, d 2 ) > ν2 2 t: (d T, d t ) (d T, d t+1 ) (d t+1, d t ) > ν2 2 T 2: (d T, d T 2 ) (d T, d T 1 ) (d T 1, d T 2 ) > ν2 2 T 1: (d T, d T 1 ) (d T, d T ) (d }{{} T, d T 1 ) > ν2 2 0 Therefore, T 2 ln N ν 2 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

63 Convergence TotalBoost ν which optimizes (d, d t ) Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

64 Outline Illustrative Experiments 1 Basics 2 Large margins 3 Games 4 Convergence 5 Illustrative Experiments [Warmuth, Liao & G.R., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

65 Illustrative Experiments Illustrative Experiments Cox-1 dataset from Telik Inc. Relatively small drug-design data set 125 binary labeled examples 3888 binary features Compare convergence of margin versus number of iterations [Warmuth, Liao & G.R., 2006] Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

66 Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

67 Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

68 Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

69 Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

70 Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

71 Illustrative Experiments Illustrative Experiments Cox-1 (ν = 0.01) AdaBoost ν LPBoost TotalBoost ν Results Corrective algorithms very slow LPBoost & TotalBoost ν need few iterations Initial speed crucially depends on ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

72 Illustrative Experiments LPBoost May Perform Much Worse Than TotalBoost Identified cases where LPBoost converges considerably slower than TotalBoost ν Dataset is a series of artificial datasets of 1000 examples with varying number of features created as follows: First generated N 1 random ±1-valued features x 1,..., x N1 and set the label of the examples as y = sign(x 1 + x 2 + x 3 + x 4 + x 5 ) Then duplicated each features N 2 times, perturbed the features by Gaussian noise with σ = 0.1, and clipped the feature values so that they lie in the interval [-1,1] Considered different N 1, N 2, the total number of features is N 2 N 1 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

73 Illustrative Experiments LPBoost performs worse for high dimensional data with many redundant features LPBoost vs. TotalBoostν on two 100,000 dimensional datasets: [left] many redundant features (N1 = 1, 000, N2 = 100) and [right] independent features (N1 = 100, 000, N2 = 1). Show margin vs. number of iterations Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

74 Illustrative Experiments Bound Not True for LPBoost A counter example: Examples Hypothesis No Table 4.1: A toy dataset. Each row corresponds an example and each column corresponds a hypothesis. +: correct prediction -: incorrect prediction If the prediction of an example by a hypothesis is correct, it is displayed as + otherwise displayed as TotalBoost -. averages hypothesis 1 and 6 (i.e. 2 iterations) to achieve maximum margin 0 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

75 Illustrative Experiments Bound Not True for LPBoost Distribuition of Examples Hypothesis No. Iteration No / / / / / / / / / Selected Hypothesis No Table 4.2: The first five iterations of LPBoost. We show the distribution of examples and the selected Simplex-based hypothesis No. at LPBoost each iteration. uses 5 iterations ((N+1)/2 iterations) to achieve margin 0 Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

76 Illustrative Experiments Regularized LPBoost LPBoost makes edge constraints as tight as possible Picking solutions at the corners can lead to slow convergence. Interior points methods avoid corners Regularized LPBoost: pick solution that minimizes relative entropy to initial distribution. It is identical to TotalBoost ν but the latter algorithm uses a higher edge bound Open problem: find an example where all versions of LPBoost need O(N) iterations Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

77 Illustrative Experiments Algorithms for Feature Selection We test for each algorithm: Whether selected hypotheses are redundant Size of final hypothesis Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

78 Illustrative Experiments Redundancy of Selected Hypotheses Test which algorithm selects redundant hypotheses Dataset: Dataset 2 was created by expanding Rudin s dataset. Dataset 2 has 100 blocks of features. Each block contains one original feature and 99 mutations of this feature. Each mutation feature inverts the feature value of one randomly chosen example (with replacement). The mutation features are considered redundant hypotheses. A good algorithm would avoid repeatedly selecting hypotheses from the same block. Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

79 Illustrative Experiments Experiment on Redundancy of Selected Hypotheses Show number of selected blocks v.s. number of selected hypotheses Redundancy of selected hypotheses: AdaBoost > TotalBoost ν (ν=0.01) > LPBoost If ρ is known, TotalBoost g ν selects one hypothesis per block (not shown). Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

80 Illustrative Experiments Size of Final Hypothesis Show margins v.s. number of used hypotheses (nonzero weight) TotalBoost ν (ν=0.01) and LPBoost use a small number of hypotheses in final hypothesis Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

81 Summary Illustrative Experiments AdaBoost can be viewed as entropy projection TotalBoost projects based on all previous hypotheses Provably maximizes the margin Theory: as fast as AdaBoost ν Practice: much faster ( LPBoost) Versatile techniques for proving iteration bounds Experiments corroborate our theory Good for feature selection LPBoost may have problems of maximizing the margin Future: extension to the soft margin case Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

82 Illustrative Experiments Iteration Bound for Variant of AdaBoost ν Totally Corrective Boosting Algorithms that Maximize Last update: the Margin March 2, / 62

Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62

Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62 Updated: March 23, 2010 Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62 ICML 2009 Tutorial Survey of Boosting from an Optimization Perspective Part I: Entropy Regularized LPBoost Part II: Boosting from

More information

Boosting Algorithms for Maximizing the Soft Margin

Boosting Algorithms for Maximizing the Soft Margin Boosting Algorithms for Maximizing the Soft Margin Manfred K. Warmuth Dept. of Engineering University of California Santa Cruz, CA, U.S.A. Karen Glocer Dept. of Engineering University of California Santa

More information

Entropy Regularized LPBoost

Entropy Regularized LPBoost Entropy Regularized LPBoost Manfred K. Warmuth Karen Glocer S.V.N. Vishwanathan (pretty slides from Gunnar Rätsch) Updated: October 13, 2008 M.K.Warmuth et.al. () Entropy Regularized LPBoost Updated: October

More information

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:

More information

Maximizing the Margin with Boosting

Maximizing the Margin with Boosting Maximizing the Margin with Boosting Gunnar Rätsch and Manfred K. Warmuth RSISE, Australian National University Canberra, ACT 000, Australia Gunnar.Raetsch@anu.edu.au University of California at Santa Cruz

More information

Entropy Regularized LPBoost

Entropy Regularized LPBoost Entropy Regularized LPBoost Manfred K. Warmuth 1, Karen A. Glocer 1, and S.V.N. Vishwanathan 2 1 Computer Science Department University of California, Santa Cruz CA 95064, U.S.A {manfred,kag}@cse.ucsc.edu

More information

The Dynamics of AdaBoost: Cyclic Behavior and Convergence of Margins

The Dynamics of AdaBoost: Cyclic Behavior and Convergence of Margins Journal of Machine Learning Research 0 0) 0 Submitted 0; Published 0 The Dynamics of AdaBoost: Cyclic Behavior and Convergence of Margins Cynthia Rudin crudin@princetonedu, rudin@nyuedu Program in Applied

More information

Game Theory, On-line prediction and Boosting (Freund, Schapire)

Game Theory, On-line prediction and Boosting (Freund, Schapire) Game heory, On-line prediction and Boosting (Freund, Schapire) Idan Attias /4/208 INRODUCION he purpose of this paper is to bring out the close connection between game theory, on-line prediction and boosting,

More information

Hierarchical Boosting and Filter Generation

Hierarchical Boosting and Filter Generation January 29, 2007 Plan Combining Classifiers Boosting Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Hierarchical Structure Filters Combining Classifiers Combining Classifiers

More information

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m ) Set W () i The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m m =.5 1 n W (m 1) i y i h(x i ; 2 ˆθ

More information

i=1 = H t 1 (x) + α t h t (x)

i=1 = H t 1 (x) + α t h t (x) AdaBoost AdaBoost, which stands for ``Adaptive Boosting", is an ensemble learning algorithm that uses the boosting paradigm []. We will discuss AdaBoost for binary classification. That is, we assume that

More information

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Learning Ensembles. 293S T. Yang. UCSB, 2017. Learning Ensembles 293S T. Yang. UCSB, 2017. Outlines Learning Assembles Random Forest Adaboost Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous)

More information

A Brief Introduction to Adaboost

A Brief Introduction to Adaboost A Brief Introduction to Adaboost Hongbo Deng 6 Feb, 2007 Some of the slides are borrowed from Derek Hoiem & Jan ˇSochman. 1 Outline Background Adaboost Algorithm Theory/Interpretations 2 What s So Good

More information

Interactive Feature Selection with

Interactive Feature Selection with Chapter 6 Interactive Feature Selection with TotalBoost g ν We saw in the experimental section that the generalization performance of the corrective and totally corrective boosting algorithms is comparable.

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013 COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 203 Review of Zero-Sum Games At the end of last lecture, we discussed a model for two player games (call

More information

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo! Theory and Applications of A Repeated Game Playing Algorithm Rob Schapire Princeton University [currently visiting Yahoo! Research] Learning Is (Often) Just a Game some learning problems: learn from training

More information

Machine Learning. Ensemble Methods. Manfred Huber

Machine Learning. Ensemble Methods. Manfred Huber Machine Learning Ensemble Methods Manfred Huber 2015 1 Bias, Variance, Noise Classification errors have different sources Choice of hypothesis space and algorithm Training set Noise in the data The expected

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Learning, Games, and Networks

Learning, Games, and Networks Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

Online Kernel PCA with Entropic Matrix Updates

Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth Computer Science Department, University of California - Santa Cruz dima@cse.ucsc.edu manfred@cse.ucsc.edu Abstract A number of updates for density matrices have been developed

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018 COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 208 Review of Game heory he games we will discuss are two-player games that can be modeled by a game matrix

More information

Infinite Ensemble Learning with Support Vector Machinery

Infinite Ensemble Learning with Support Vector Machinery Infinite Ensemble Learning with Support Vector Machinery Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology ECML/PKDD, October 4, 2005 H.-T. Lin and L. Li (Learning Systems

More information

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c15 2013/9/9 page 331 le-tex 331 15 Ensemble Learning The expression ensemble learning refers to a broad class

More information

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch . Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

COMS 4771 Lecture Boosting 1 / 16

COMS 4771 Lecture Boosting 1 / 16 COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?

More information

Precise Statements of Convergence for AdaBoost and arc-gv

Precise Statements of Convergence for AdaBoost and arc-gv Contemporary Mathematics Precise Statements of Convergence for AdaBoost and arc-gv Cynthia Rudin, Robert E Schapire, and Ingrid Daubechies We wish to dedicate this paper to Leo Breiman Abstract We present

More information

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016 The Boosting Approach to Machine Learning Maria-Florina Balcan 10/31/2016 Boosting General method for improving the accuracy of any given learning algorithm. Works by creating a series of challenge datasets

More information

CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning

More information

Boosting: Foundations and Algorithms. Rob Schapire

Boosting: Foundations and Algorithms. Rob Schapire Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology AdaBoost S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology 1 Introduction In this chapter, we are considering AdaBoost algorithm for the two class classification problem.

More information

Voting (Ensemble Methods)

Voting (Ensemble Methods) 1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

More information

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan Adaptive Filters and Machine Learning Boosting and Bagging Background Poltayev Rassulzhan rasulzhan@gmail.com Resampling Bootstrap We are using training set and different subsets in order to validate results

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,

More information

Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection Koji Tsuda, Gunnar Rätsch and Manfred K. Warmuth Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076

More information

Large-Margin Thresholded Ensembles for Ordinal Regression

Large-Margin Thresholded Ensembles for Ordinal Regression Large-Margin Thresholded Ensembles for Ordinal Regression Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology, U.S.A. Conf. on Algorithmic Learning Theory, October 9,

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Boosting Methods for Regression

Boosting Methods for Regression Machine Learning, 47, 153 00, 00 c 00 Kluwer Academic Publishers. Manufactured in The Netherlands. Boosting Methods for Regression NIGEL DUFFY nigeduff@cse.ucsc.edu DAVID HELMBOLD dph@cse.ucsc.edu Computer

More information

2 Upper-bound of Generalization Error of AdaBoost

2 Upper-bound of Generalization Error of AdaBoost COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Haipeng Zheng March 5, 2008 1 Review of AdaBoost Algorithm Here is the AdaBoost Algorithm: input: (x 1,y 1 ),...,(x m,y

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Monte Carlo Theory as an Explanation of Bagging and Boosting

Monte Carlo Theory as an Explanation of Bagging and Boosting Monte Carlo Theory as an Explanation of Bagging and Boosting Roberto Esposito Università di Torino C.so Svizzera 185, Torino, Italy esposito@di.unito.it Lorenza Saitta Università del Piemonte Orientale

More information

Linear Programming Boosting via Column Generation

Linear Programming Boosting via Column Generation Linear Programming Boosting via Column Generation Ayhan Demiriz demira@rpi.edu Dept. of Decision Sciences and Engineering Systems, Rensselaer Polytechnic Institute, Troy, NY 12180 USA Kristin P. Bennett

More information

An Introduction to Boosting and Leveraging

An Introduction to Boosting and Leveraging An Introduction to Boosting and Leveraging Ron Meir 1 and Gunnar Rätsch 2 1 Department of Electrical Engineering, Technion, Haifa 32000, Israel rmeir@ee.technion.ac.il, http://www-ee.technion.ac.il/ rmeir

More information

Boosting. March 30, 2009

Boosting. March 30, 2009 Boosting Peter Bühlmann buhlmann@stat.math.ethz.ch Seminar für Statistik ETH Zürich Zürich, CH-8092, Switzerland Bin Yu binyu@stat.berkeley.edu Department of Statistics University of California Berkeley,

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization

More information

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Boosting Methods: Why They Can Be Useful for High-Dimensional Data New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

More information

Reducing the Overfitting of AdaBoost by Controlling its Data Distribution Skewness

Reducing the Overfitting of AdaBoost by Controlling its Data Distribution Skewness International Journal of Pattern Recognition and Artificial Intelligence c World Scientific Publishing Company Reducing the Overfitting of AdaBoost by Controlling its Data Distribution Skewness Yijun Sun,

More information

Adaptive Boosting of Neural Networks for Character Recognition

Adaptive Boosting of Neural Networks for Character Recognition Adaptive Boosting of Neural Networks for Character Recognition Holger Schwenk Yoshua Bengio Dept. Informatique et Recherche Opérationnelle Université de Montréal, Montreal, Qc H3C-3J7, Canada fschwenk,bengioyg@iro.umontreal.ca

More information

Data Warehousing & Data Mining

Data Warehousing & Data Mining 13. Meta-Algorithms for Classification Data Warehousing & Data Mining Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 13.

More information

Generalization Bounds

Generalization Bounds Generalization Bounds Here we consider the problem of learning from binary labels. We assume training data D = x 1, y 1,... x N, y N with y t being one of the two values 1 or 1. We will assume that these

More information

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers JMLR: Workshop and Conference Proceedings vol 35:1 8, 014 Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers Balázs Kégl LAL/LRI, University

More information

Statistics and learning: Big Data

Statistics and learning: Big Data Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH. Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Ensemble Methods in ML Combining several base classifiers

More information

Large-Margin Thresholded Ensembles for Ordinal Regression

Large-Margin Thresholded Ensembles for Ordinal Regression Large-Margin Thresholded Ensembles for Ordinal Regression Hsuan-Tien Lin (accepted by ALT 06, joint work with Ling Li) Learning Systems Group, Caltech Workshop Talk in MLSS 2006, Taipei, Taiwan, 07/25/2006

More information

Multiclass Boosting with Repartitioning

Multiclass Boosting with Repartitioning Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006 Binary and Multiclass Problems Binary classification problems Y = { 1, 1} Multiclass classification problems Y

More information

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees

More information

CS 231A Section 1: Linear Algebra & Probability Review

CS 231A Section 1: Linear Algebra & Probability Review CS 231A Section 1: Linear Algebra & Probability Review 1 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability

More information

Chapter 14 Combining Models

Chapter 14 Combining Models Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients

More information

Novel Distance-Based SVM Kernels for Infinite Ensemble Learning

Novel Distance-Based SVM Kernels for Infinite Ensemble Learning Novel Distance-Based SVM Kernels for Infinite Ensemble Learning Hsuan-Tien Lin and Ling Li htlin@caltech.edu, ling@caltech.edu Learning Systems Group, California Institute of Technology, USA Abstract Ensemble

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Günther Eibl and Karl Peter Pfeiffer Institute of Biostatistics, Innsbruck, Austria guenther.eibl@uibk.ac.at Abstract.

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods The wisdom of the crowds Ensemble learning Sir Francis Galton discovered in the early 1900s that a collection of educated guesses can add up to very accurate predictions! Chapter 11 The paper in which

More information

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of

More information

Boosting: Algorithms and Applications

Boosting: Algorithms and Applications Boosting: Algorithms and Applications Lecture 11, ENGN 4522/6520, Statistical Pattern Recognition and Its Applications in Computer Vision ANU 2 nd Semester, 2008 Chunhua Shen, NICTA/RSISE Boosting Definition

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Support Vector Machine (continued)

Support Vector Machine (continued) Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need

More information

Convex Repeated Games and Fenchel Duality

Convex Repeated Games and Fenchel Duality Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,

More information

Contents of this Course Part I: Introduction to Boosting 1. Basic Issues, PAC Learning 2. The Idea of Boosting 3. AdaBoost s Behavior on the Sample Pa

Contents of this Course Part I: Introduction to Boosting 1. Basic Issues, PAC Learning 2. The Idea of Boosting 3. AdaBoost s Behavior on the Sample Pa An Introduction to Boosting Part I Gunnar Rätsch Friedrich Miescher Laboratory of the Max Planck Society http://www.fml.tuebingen.mpg.de/~raetsch Help with earlier presentations: Ron Meir, Klaus-R. Müller

More information

Logistic Regression, AdaBoost and Bregman Distances

Logistic Regression, AdaBoost and Bregman Distances Proceedings of the Thirteenth Annual Conference on ComputationalLearning Theory, Logistic Regression, AdaBoost and Bregman Distances Michael Collins AT&T Labs Research Shannon Laboratory 8 Park Avenue,

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 8: Boosting (and Compression Schemes) Boosting the Error If we have an efficient learning algorithm that for any distribution

More information

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang CS 231A Section 1: Linear Algebra & Probability Review Kevin Tang Kevin Tang Section 1-1 9/30/2011 Topics Support Vector Machines Boosting Viola Jones face detector Linear Algebra Review Notation Operations

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

A ROBUST ENSEMBLE LEARNING USING ZERO-ONE LOSS FUNCTION

A ROBUST ENSEMBLE LEARNING USING ZERO-ONE LOSS FUNCTION Journal of the Operations Research Society of Japan 008, Vol. 51, No. 1, 95-110 A ROBUST ENSEMBLE LEARNING USING ZERO-ONE LOSS FUNCTION Natsuki Sano Hideo Suzuki Masato Koda Central Research Institute

More information

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual Outline: Ensemble Learning We will describe and investigate algorithms to Ensemble Learning Lecture 10, DD2431 Machine Learning A. Maki, J. Sullivan October 2014 train weak classifiers/regressors and how

More information

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University BOOSTING Robert E. Schapire and Yoav

More information

Convex Repeated Games and Fenchel Duality

Convex Repeated Games and Fenchel Duality Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori

More information

The Boosting Approach to Machine Learning. Rob Schapire Princeton University. schapire

The Boosting Approach to Machine Learning. Rob Schapire Princeton University.  schapire The Boosting Approach to Machine Learning Rob Schapire Princeton University www.cs.princeton.edu/ schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples

More information

Introduction to Boosting and Joint Boosting

Introduction to Boosting and Joint Boosting Introduction to Boosting and Learning Systems Group, Caltech 2005/04/26, Presentation in EE150 Boosting and Outline Introduction to Boosting 1 Introduction to Boosting Intuition of Boosting Adaptive Boosting

More information

Game Theory. Greg Plaxton Theory in Programming Practice, Spring 2004 Department of Computer Science University of Texas at Austin

Game Theory. Greg Plaxton Theory in Programming Practice, Spring 2004 Department of Computer Science University of Texas at Austin Game Theory Greg Plaxton Theory in Programming Practice, Spring 2004 Department of Computer Science University of Texas at Austin Bimatrix Games We are given two real m n matrices A = (a ij ), B = (b ij

More information

Leaving The Span Manfred K. Warmuth and Vishy Vishwanathan

Leaving The Span Manfred K. Warmuth and Vishy Vishwanathan Leaving The Span Manfred K. Warmuth and Vishy Vishwanathan UCSC and NICTA Talk at NYAS Conference, 10-27-06 Thanks to Dima and Jun 1 Let s keep it simple Linear Regression 10 8 6 4 2 0 2 4 6 8 8 6 4 2

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity What makes good ensemble? CS789: Machine Learning and Neural Network Ensemble methods Jakramate Bootkrajang Department of Computer Science Chiang Mai University 1. A member of the ensemble is accurate.

More information

Barrier Boosting. Abstract. 1 Introduction. 2 Boosting and Convex Programming

Barrier Boosting. Abstract. 1 Introduction. 2 Boosting and Convex Programming Barrier Boosting Gunnar Rätsch, Manfred Warmuth, Sebastian Mika, Takashi Onoda Ý, Steven Lemm and Klaus-Robert Müller GMD FIRST, Kekuléstr. 7, 12489 Berlin, Germany Computer Science Department, UC Santa

More information

Ensembles. Léon Bottou COS 424 4/8/2010

Ensembles. Léon Bottou COS 424 4/8/2010 Ensembles Léon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. Léon

More information

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Support Vector Machines. Manfred Huber Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

Boosting & Deep Learning

Boosting & Deep Learning Boosting & Deep Learning Ensemble Learning n So far learning methods that learn a single hypothesis, chosen form a hypothesis space that is used to make predictions n Ensemble learning à select a collection

More information

Zero-Sum Games Public Strategies Minimax Theorem and Nash Equilibria Appendix. Zero-Sum Games. Algorithmic Game Theory.

Zero-Sum Games Public Strategies Minimax Theorem and Nash Equilibria Appendix. Zero-Sum Games. Algorithmic Game Theory. Public Strategies Minimax Theorem and Nash Equilibria Appendix 2013 Public Strategies Minimax Theorem and Nash Equilibria Appendix Definition Definition A zero-sum game is a strategic game, in which for

More information