Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62

Size: px

Start display at page:

Download "Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62"

Rachel Turner
5 years ago
Views:

1 Updated: March 23, 2010 Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62 ICML 2009 Tutorial Survey of Boosting from an Optimization Perspective Part I: Entropy Regularized LPBoost Part II: Boosting from an Optimization Perspective Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research

2 Outline 1 Introduction to Boosting 2 What is Boosting? 3 Entropy Regularized LPBoost 4 Overview of Boosting algorithms 5 Conclusion and Open Problems Warmuth (UCSC) ICML 09 Boosting Tutorial 2 / 62

3 Outline Introduction to Boosting 1 Introduction to Boosting 2 What is Boosting? 3 Entropy Regularized LPBoost 4 Overview of Boosting algorithms 5 Conclusion and Open Problems Warmuth (UCSC) ICML 09 Boosting Tutorial 3 / 62

4 Introduction to Boosting Setup for Boosting [Giants of field: Schapire,Freund] examples: 11 apples +1 if artificial - 1 if natural goal: classification Warmuth (UCSC) ICML 09 Boosting Tutorial 4 / 62

5 Introduction to Boosting Setup for Boosting /-1 examples weight d n size feature separable feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 5 / 62

6 Introduction to Boosting Weak hypotheses feature weak hypotheses: decision stumps on two features one can t do it goal: find convex combination of weak hypotheses that classifies all feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 6 / 62

7 Introduction to Boosting Boosting: 1st iteration First hypothesis: 1 error: 11 9 edge: 11 low error = high edge feature edge = 1 2 error feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 7 / 62

8 Introduction to Boosting Update after 1st feature Misclassified examples increased weights After update edge of hypothesis decreased feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 8 / 62

9 Introduction to Boosting Before 2nd iteration 0.8 Hard examples high weight 0.6 feature feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 9 / 62

10 Introduction to Boosting Boosting: 2nd hypothesis Pick hypotheses with high (weighted) edge feature feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 10 / 62

11 Introduction to Boosting Update after 2nd After update edges of all past hypotheses should be small feature feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 11 / 62

12 3rd hypothesis Introduction to Boosting feature feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 12 / 62

13 Update after 3rd Introduction to Boosting feature feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 13 / 62

14 4th hypothesis Introduction to Boosting feature feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 14 / 62

15 Update after 4th Introduction to Boosting feature feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 15 / 62

16 Introduction to Boosting Final convex combination of all hypotheses Decision: T t=1 w th t (x) 0? feature feature 1 Positive total weight - Negative total weight Warmuth (UCSC) ICML 09 Boosting Tutorial 16 / 62

17 Introduction to Boosting Protocol of Boosting [FS97] Maintain distribution on N ±1 labeled examples At iteration t = 1,..., T : - Receive weak hypothesis h t of high edge - Update d t 1 to d t more weights on hard examples Output convex combination of the weak hypotheses T t=1 w t h t (x) Two sets of weights: - distribution d on examples - distribution w on hypotheses Warmuth (UCSC) ICML 09 Boosting Tutorial 17 / 62

18 Data representation Introduction to Boosting examples x n labels y n h 1 (x n ) u y n h t (x n ) := un t perfect +1 opposite -1 neutral Warmuth (UCSC) ICML 09 Boosting Tutorial 18 / 62

19 Edge vs. margin Introduction to Boosting [Br99] Edge of a hypothesis h t for a distribution d on the examples N n=1 accuracy of example {}}{ u t n d n }{{} weighted accuracy of hypothesis d P N Margin of example n for current hypothesis weighting w T t=1 accuracy of example {}}{ u t n w t }{{} weighted accuracy of example w P T Warmuth (UCSC) ICML 09 Boosting Tutorial 19 / 62

20 Edge vs. margin Introduction to Boosting [Br99] Edge of a hypothesis h t for a distribution d on the examples N n=1 accuracy of example {}}{ u t n d n }{{} weighted accuracy of hypothesis d P N Margin of example n for current hypothesis weighting w T t=1 accuracy of example {}}{ u t n w t }{{} weighted accuracy of example w P T Warmuth (UCSC) ICML 09 Boosting Tutorial 19 / 62

21 AdaBoost Introduction to Boosting Initialize t = 0 and d 0 n = 1 N For t = 1,..., T Get h t whose edge w.r.t current distribution is 1 2ɛ ( ) t Set w t = 1 ln 1 ɛ t 2 ɛ t Update distribution as follows d t n = ( T Final hypothesis: sgn t=1 w th t ( ) d n t 1 exp( w t un) t n d t 1 n exp( w t un t ) ) Warmuth (UCSC) ICML 09 Boosting Tutorial 20 / 62

22 Objectives Introduction to Boosting Edge Margin Edges of past hypotheses should be small after update Minimize maximum edge of past hypotheses Choose convex combination of weak hypotheses that maximizes the minimum margin 0.8 feature SVM Boosting Which margin? 2-norm (weights on examples) 1-norm (weights on base hypotheses) feature 1 Connection between objectives? Warmuth (UCSC) ICML 09 Boosting Tutorial 21 / 62

23 Edge vs. margin Introduction to Boosting min max edge = max min margin min d S N max u q d }{{} q=1,2,...,t 1 edge of hypothesis q = max w S t 1 min n=1,2,...,n t 1 un q w q q=1 }{{} margin of example n Linear Programming duality Warmuth (UCSC) ICML 09 Boosting Tutorial 22 / 62

24 Introduction to Boosting Boosting as zero-sum-game [FS97] Rock, Paper, Scissors game column player R P S w 1 w 2 w 3 row player R d P d S d gain matrix Single row is pure strategy of row player and d is mixed strategy Row player minimizes Column player maximizes payoff = d T U w = i,j d iu i,j w j Single column is pure strategy of column player and w is mixed strategy Warmuth (UCSC) ICML 09 Boosting Tutorial 23 / 62

25 Optimum strategy Introduction to Boosting Min-max theorem: e j is pure strategy R P S w 1 w 2 w R d P d S d min max d w dt Uw = min max d T Ue j d j = max min d T Uw = max min e i Uw w d w i = value of the game ( 0 in example ) Warmuth (UCSC) ICML 09 Boosting Tutorial 24 / 62

26 Introduction to Boosting Connection to Boosting? Rows are the examples Columns u q encode weak hypothesis h q Row sum: margin of example Column sum: edge of weak hypothesis Value of game: min max edge = max min margin Van Neumann s Minimax Theorem Warmuth (UCSC) ICML 09 Boosting Tutorial 25 / 62

27 Edges/margins Introduction to Boosting R P S w 1 w 2 w 3 margin R d P d min S d edge max value of game 0 Warmuth (UCSC) ICML 09 Boosting Tutorial 26 / 62

28 Introduction to Boosting New column added: boosting R P S w 1 w 2 w 3 w 4 margin R d P d min S d edge max Value of game increases from 0 to.11 Warmuth (UCSC) ICML 09 Boosting Tutorial 27 / 62

29 Introduction to Boosting Row added: on-line learning R P S w 1 w 2 w 3 margin R d P d min S d d edge max Value of game decreases from 0 to -.11 Warmuth (UCSC) ICML 09 Boosting Tutorial 28 / 62

30 Introduction to Boosting Boosting: maximize margin incrementally w1 1 d1 1 0 d2 1 1 d3 1-1 w 2 1 w 2 2 d d d w 3 1 w 3 2 w 3 3 d d d iteration 1 iteration 2 iteration 3 In each iteration solve optimization problem to update d Column player / oracle provides new hypothesis Boosting is column generation method in d domain and coordinate descent in w domain Warmuth (UCSC) ICML 09 Boosting Tutorial 29 / 62

31 Outline What is Boosting? 1 Introduction to Boosting 2 What is Boosting? 3 Entropy Regularized LPBoost 4 Overview of Boosting algorithms 5 Conclusion and Open Problems Warmuth (UCSC) ICML 09 Boosting Tutorial 30 / 62

32 Want small number of iterations Warmuth (UCSC) ICML 09 Boosting Tutorial 31 / 62 What is Boosting? Boosting = greedy method for increasing margin Converges to optimum margin w.r.t. all hypotheses

33 What is Boosting? Assumption on next weak hypothesis For current weighting of examples, oracle returns hypothesis of edge g Goal For given ɛ, produce convex combination of weak hypotheses with soft margin g ɛ Number of iterations O( log N ɛ 2 ) Warmuth (UCSC) ICML 09 Boosting Tutorial 32 / 62

34 Recall min max thm What is Boosting? min d S N max q=1,2,...,t = max w S t u q d }{{} edge of hypothesis q min n=1,2,...,n ( t ) un q w q q=1 }{{} margin of example n Warmuth (UCSC) ICML 09 Boosting Tutorial 33 / 62

35 What is Boosting? Visualizing the margin feature feature 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 34 / 62

36 What is Boosting? Min max thm - inseparable case Slack variables in w domain = capping in d domain min d S N,d 1 ν 1 max q=1,2,...,t = max w S t,ψ 0 u q d }{{} edge of hypothesis q min n=1,2,...,n ( t ) un q w q + ψ n 1 ν q=1 }{{} soft margin of example n N n=1 ψ n Warmuth (UCSC) ICML 09 Boosting Tutorial 35 / 62

37 What is Boosting? Visualizing the soft margin hypothesis ψ hypothesis 1 Warmuth (UCSC) ICML 09 Boosting Tutorial 36 / 62

38 LPBoost What is Boosting? Objective value P LP d 1 Choose distribution that minimizes the maximum edge of current hypotheses by solving: min P n dn=1,d 1 ν 1 max q=1,2,...,t u q d } {{ } PLP t All weight is put on examples with minimum soft margin Warmuth (UCSC) ICML 09 Boosting Tutorial 37 / 62

39 Outline Entropy Regularized LPBoost 1 Introduction to Boosting 2 What is Boosting? 3 Entropy Regularized LPBoost 4 Overview of Boosting algorithms 5 Conclusion and Open Problems Warmuth (UCSC) ICML 09 Boosting Tutorial 38 / 62

40 Entropy Regularized LPBoost Entropy Regularized LPBoost min P n dn=1,d 1 ν 1 max q=1,2,...,t u q d + 1 η (d, d0 ) d n = exp η soft margin of example n Z soft min Form of weights first in ν-arc algorithm [RSS+00] Regularization in d domain makes problem strongly convex Gradient of dual Lipschitz continuous in w [e.g. HL93,RW97] Warmuth (UCSC) ICML 09 Boosting Tutorial 39 / 62

41 Entropy Regularized LPBoost The effect of entropy regularization Different distribution on the examples feature feature feature 1 LPBoost: lots of zeros / brittle feature 1 ERLPBoost: smoother Warmuth (UCSC) ICML 09 Boosting Tutorial 40 / 62

42 Outline Overview of Boosting algorithms 1 Introduction to Boosting 2 What is Boosting? 3 Entropy Regularized LPBoost 4 Overview of Boosting algorithms 5 Conclusion and Open Problems Warmuth (UCSC) ICML 09 Boosting Tutorial 41 / 62

43 AdaBoost Overview of Boosting algorithms [FS97] d t n := where w t s.t. n d t 1 n exp( w u t n i.e. P n dt 1 n exp( w u t n ) w w=wt = n ut n d n t 1 exp( w t un) t n d t 1 n exp( w t un t ), ) is minimized d t 1 n exp( w t u t n) Pn dt 1 n exp( w t u t n ) = ut d t = 0 Easy to implement Adjusts distribution so that edge of last hypothesis is zero Gets within half of the optimal hard margin [RSD07] but only in the limit Warmuth (UCSC) ICML 09 Boosting Tutorial 42 / 62

44 Overview of Boosting algorithms Corrective versus totally corrective Processing last hypothesis versus all past hypotheses Corrective AdaBoost LogitBoost AdaBoost* SS,Colt08 Totally Corrective LPBoost TotalBoost SoftBoost ERLPBoost Warmuth (UCSC) ICML 09 Boosting Tutorial 43 / 62

45 Overview of Boosting algorithms From AdaBoost to ERLPBoost AdaBoost (as interpreted in [KW99,La99]) Primal: Dual: min d (d, d t 1 ) s.t. d u t = 0, d 1 = 1 Achieves half of optimum hard margin in the limit AdaBoost Primal: min d (d, d t 1 ) s.t. d u t γ t, d 1 = 1 max ln w n d n t 1 exp( ηunw t t ) s.t. w 0 Dual: max w [RW05] ln n d t 1 n exp( ηu t nw t ) γ t w 1 s.t. w 0 where edge bound γ t is adjusted downward by a heuristic Good iteration bound for reaching optimum hard margin Warmuth (UCSC) ICML 09 Boosting Tutorial 44 / 62

46 SoftBoost Primal: Overview of Boosting algorithms min (d, d 0 ) d s.t. d 1 = 1, d 11 ν d u q γ t, 1 q t Dual: [WGR07] min ln d 0 n exp( η t unw q q w,ψ n q=1 ηψ n ) 1 ψ ν 1 γ t w 1 s.t. w 0, ψ 0 where edge bound γ t is adjusted downward by a heuristic Good iteration bound for reaching soft margin ERLPBoost Primal: min d,γ γ + 1 η (d, d0 ) s.t. d 1 = 1, d 11 ν d u q γ, 1 q t Dual: [WGV08] min 1 ln d 0 w,ψ η n exp( η t unw q q n q=1 ηψ n ) 1 ψ ν 1 s.t. w 0, w 1 = 1, ψ 0 where for the iteration bound η is fixed to max( 2 ɛ ln N ν, 1 2 ) Good iteration bound for reaching soft margin Warmuth (UCSC) ICML 09 Boosting Tutorial 45 / 62

47 Overview of Boosting algorithms Corrective ERLPBoost Primal: min d t q=1 w q(u q d) + 1 (d, η d0 ) s.t. d 1 = 1, d 11 ν Dual: min 1 ln ψ η n s.t. ψ 0 d 0 n exp( η t unw q q ηψ n ) 1 ψ ν 1 q=1 [SS08] where for the iteration bound η is fixed to max( 2 ɛ ln N ν, 1 2 ) Good iteration bound for reaching soft margin Warmuth (UCSC) ICML 09 Boosting Tutorial 46 / 62

48 Iteration bounds Overview of Boosting algorithms Corrective AdaBoost LogitBoost AdaBoost* SS,Colt08 Totally Corrective LPBoost TotalBoost SoftBoost ERLPBoost Strong oracle: returns hypothesis with maximum edge Weak oracle: returns hypothesis with edge g In O( log N ν ) iterations 2 within ɛ of maximum soft margin for strong oracle or within ɛ of g for weak oracle Ditto for hard margin case In O( log N ) iterations consistency with weak oracle g 2 Warmuth (UCSC) ICML 09 Boosting Tutorial 47 / 62

49 Overview of Boosting algorithms LPBoost may require Ω(N) iterations w 1 w 2 w 3 w 4 w 5 margin d d d d d d d d edge value -1 Warmuth (UCSC) ICML 09 Boosting Tutorial 48 / 62

50 Overview of Boosting algorithms LPBoost may require Ω(N) iterations w 1 w 2 w 3 w 4 w 5 margin d d d d d d d d edge value Warmuth (UCSC) ICML 09 Boosting Tutorial 49 / 62

51 Overview of Boosting algorithms LPBoost may require Ω(N) iterations w 1 w 2 w 3 w 4 w 5 margin d d d d d d d d edge value Warmuth (UCSC) ICML 09 Boosting Tutorial 50 / 62

52 Overview of Boosting algorithms LPBoost may require Ω(N) iterations w 1 w 2 w 3 w 4 w 5 margin d d d d d d d d edge value Warmuth (UCSC) ICML 09 Boosting Tutorial 51 / 62

53 Overview of Boosting algorithms LPBoost may require Ω(N) iterations w 1 w 2 w 3 w 4 w 5 margin d d d d d d d d edge value Warmuth (UCSC) ICML 09 Boosting Tutorial 52 / 62

54 Overview of Boosting algorithms LPBoost may require Ω(N) iterations w 1 w 2 w 3 w 4 w 5 margin d d d d d d d d edge value No ties! Warmuth (UCSC) ICML 09 Boosting Tutorial 53 / 62

55 Overview of Boosting algorithms LPBoost may return bad final hypothesis How good is the master hypothesis returned by LPBoost compared to the best possible convex combination of hypotheses? Any linearly separable dataset can be reduced to a dataset on which LPBoost misclassifies all examples by adding a bad example adding a bad hypothesis Warmuth (UCSC) ICML 09 Boosting Tutorial 54 / 62

56 Overview of Boosting algorithms Adding a bad example w 1 w 2 w 3 w 4 w 5 margin d d d d d d d d d edge value Warmuth (UCSC) ICML 09 Boosting Tutorial 55 / 62

57 Overview of Boosting algorithms Adding a bad hypothesis w 1 w 2 w 3 w 4 w 5 w 6 margin d d d d d d d d d edge value Warmuth (UCSC) ICML 09 Boosting Tutorial 56 / 62

58 Overview of Boosting algorithms Adding a bad hypothesis w 1 w 2 w 3 w 4 w 5 w 6 margin d d d d d d d d d edge value Warmuth (UCSC) ICML 09 Boosting Tutorial 56 / 62

59 Overview of Boosting algorithms Adding a bad hypothesis w 1 w 2 w 3 w 4 w 5 w 6 margin d d d d d d d d d edge value Warmuth (UCSC) ICML 09 Boosting Tutorial 56 / 62

60 Overview of Boosting algorithms Adding a bad hypothesis w 1 w 2 w 3 w 4 w 5 w 6 margin d d d d d d d d d Warmuth (UCSC) ICML 09 Boosting Tutorial 56 / 62

61 Overview of Boosting algorithms Synopsis LPBoost often unstable For safety, add relative entropy regularization Corrective algs Sometimes easy to code Fast per iteration Totally corrective algs Smaller number of iterations Faster overall time when ɛ small Weak versus strong oracle makes a big difference in practice Warmuth (UCSC) ICML 09 Boosting Tutorial 57 / 62

62 Overview of Boosting algorithms O( log N ) iteration bounds ɛ 2 Good Bound is major design tool Any reasonable Boosting algorithm should have this bound Bad ln N N ɛ 2 Bound is weak ɛ =.01 N ɛ =.001 N Why are totally corrective algorithms much better in practice? Warmuth (UCSC) ICML 09 Boosting Tutorial 58 / 62

63 Overview of Boosting algorithms Lower bounds on the number of iterations Majority of Ω( log N ) hypotheses for achieving consistency with g 2 weak oracle of guarantee g [Fr95] Easy: Ω( 1 ɛ 2 ) iteration bound for getting within ɛ of hard margin with strong oracle Harder: Ω( log N ɛ 2 ) iteration bound for stron oracle [Ne83?] Warmuth (UCSC) ICML 09 Boosting Tutorial 59 / 62

64 Outline Conclusion and Open Problems 1 Introduction to Boosting 2 What is Boosting? 3 Entropy Regularized LPBoost 4 Overview of Boosting algorithms 5 Conclusion and Open Problems Warmuth (UCSC) ICML 09 Boosting Tutorial 60 / 62

65 Conclusion Conclusion and Open Problems Adding relative entropy regularization of LPBoost leads to good boosting alg. Boosting is instantiation of MaxEnt and MinxEnt principles [Jaines 57,Kullback 59] Relative entropy regularization smoothes one-norm regularization Open When hypotheses have one-sided error then O( log N ) iterations suffice [As00,HW03] ɛ Does ERLPBoost have O( log N ) bound when hypotheses ɛ one-sided? Replace geometric optimizers by entropic ones Compare ours with Freund s algorithms that don t just cap, but forget examples Warmuth (UCSC) ICML 09 Boosting Tutorial 61 / 62

66 Acknowledgment Conclusion and Open Problems Rob Schapire and Yoav Freund for pioneering Boosting Gunnar Rätsch for bringing in optimization Karen Glocer for helping with figures and plots Warmuth (UCSC) ICML 09 Boosting Tutorial 62 / 62

Entropy Regularized LPBoost

Entropy Regularized LPBoost Manfred K. Warmuth Karen Glocer S.V.N. Vishwanathan (pretty slides from Gunnar Rätsch) Updated: October 13, 2008 M.K.Warmuth et.al. () Entropy Regularized LPBoost Updated: October