Hierarchical Boosting and Filter Generation

Size: px

Start display at page:

Download "Hierarchical Boosting and Filter Generation"

Clemence Bond
5 years ago
Views:

1 January 29, 2007

2 Plan Combining Classifiers Boosting Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Hierarchical Structure Filters

3 Combining Classifiers Combining Classifiers Boosting Some learning algorithms are naturally unstable they give different results to the same dataset. How can we reduce the variance, and improve the accuracy of such an algorithm?

4 Combining Classifiers Combining Classifiers Boosting Some learning algorithms are naturally unstable they give different results to the same dataset. How can we reduce the variance, and improve the accuracy of such an algorithm? Perturb and Combine

5 Combining Classifiers Combining Classifiers Boosting Some learning algorithms are naturally unstable they give different results to the same dataset. How can we reduce the variance, and improve the accuracy of such an algorithm? Perturb and Combine Bagging (Breiman 1996) Boosting (Schapire 1989)

6 Boosting Combining Classifiers Boosting AdaBoost (Freund and Schapire 1995) Initialize distribution D 1 For each iteration t in 1..T 1. Find weak hypothesis h t with distribution D i. 2. Calculate the error ɛ t = h t(x i ) y i (D t (i)) ( ) 3. Calculate the weight of the hypothesis α t = 1 2 ln 1 ɛ t ɛ t 4. Update the distribution D t+1 (i) = Dt(i) exp( αty i h t(x i )) Z ( t T ) Strong hypothesis H(x) = sign t=1 α t h t (x)

7 Combining Classifiers Boosting Boosting (2) The distribution represents a set of weights over the training set. The biggest the weight on an example, the harder it is to classify.

8 Combining Classifiers Boosting Boosting (2) The distribution represents a set of weights over the training set. The biggest the weight on an example, the harder it is to classify. Can be extended to multi-class problem: AdaBoost.MH (Schapire and Singer 1999)

9 Combining Classifiers Boosting Boosting (2) The distribution represents a set of weights over the training set. The biggest the weight on an example, the harder it is to classify. Can be extended to multi-class problem: AdaBoost.MH (Schapire and Singer 1999) If each weak hypothesis is slightly better than random, then training error will drop exponentially fast.

10 Combining Classifiers Boosting Boosting (2) The distribution represents a set of weights over the training set. The biggest the weight on an example, the harder it is to classify. Can be extended to multi-class problem: AdaBoost.MH (Schapire and Singer 1999) If each weak hypothesis is slightly better than random, then training error will drop exponentially fast. Strong resistance to overfitting.

11 Combining Classifiers Boosting Boosting (2) The distribution represents a set of weights over the training set. The biggest the weight on an example, the harder it is to classify. Can be extended to multi-class problem: AdaBoost.MH (Schapire and Singer 1999) If each weak hypothesis is slightly better than random, then training error will drop exponentially fast. Strong resistance to overfitting. Weakness to noise.

12 Weak Learners Combining Classifiers Boosting h t : X Y Neural Networks Decision Trees Decision Stumps multi-class: h t (x i ) = y i = ψ v

13 Neural Network Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Multi-Layer Perceptron Formula: F (x) = tanh( k (w k,l tanh( j (w j,k tanh( i (w i,j x i )))))

14 Structure of AdaBoost Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Formula: H(x) = sign ( t α th t (x))

15 Comparison Neural Network Structure of AdaBoost Image processing Hierarchical Boosting How can we bring the two algorithms closer together?

16 Comparison Neural Network Structure of AdaBoost Image processing Hierarchical Boosting How can we bring the two algorithms closer together? Consider a weak hypothesis as some sort of neuron

17 Comparison Neural Network Structure of AdaBoost Image processing Hierarchical Boosting How can we bring the two algorithms closer together? Consider a weak hypothesis as some sort of neuron alpha is an output weight

18 Comparison Neural Network Structure of AdaBoost Image processing Hierarchical Boosting How can we bring the two algorithms closer together? Consider a weak hypothesis as some sort of neuron alpha is an output weight Split (if possible) the weak hypothesis in two parts such as we get a dot product: h t (x) = f t (ω x)

19 Comparison Neural Network Structure of AdaBoost Image processing Hierarchical Boosting How can we bring the two algorithms closer together? Consider a weak hypothesis as some sort of neuron alpha is an output weight Split (if possible) the weak hypothesis in two parts such as we get a dot product: h t (x) = f t (ω x) Have f t be continuous. Consider the sign function a very steep sigmoid.

20 Comparison Neural Network Structure of AdaBoost Image processing Hierarchical Boosting How can we bring the two algorithms closer together? Consider a weak hypothesis as some sort of neuron alpha is an output weight Split (if possible) the weak hypothesis in two parts such as we get a dot product: h t (x) = f t (ω x) Have f t be continuous. Consider the sign function a very steep sigmoid. This is just an analogy.

21 Comparison Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Boosting: weighted votes of many weak hypotheses. one layer of black box neurons How deep is this new network?

22 Image Processing Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Haar-like features (Viola and Jones, 2002) Much stronger than boosting decision stumps. Rectangular filters Contrast Can image-specific filters be extended to other types of data?

23 Goals Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Since boosting already creates a layer of hypotheses, can we tweak it so it creates multiple layers of simple units? Will these layers improve boosting as it is? Can we built filters that will emulate the way Haar-like features work, but not necessarilly on images?

24 Hierarchical Boosting with filters Neural Network Structure of AdaBoost Image processing Hierarchical Boosting

25 Hierarchical Structure Hierarchical Structure Filters Number of layers Size of layers How to grow them

26 Strong Hierarchy Hierarchical Structure Filters Strong Hierarchy Grow each layer, one at the time. Fixed size Competitive Growth Let each layer develop at the same time. Best unit wins. Greedy.

27 Soft Hierarchy Hierarchical Structure Filters Only one layer where to add the units.

28 Other structural possibilities Hierarchical Structure Filters Degenerated layers Preprocessing Replacement Complete Network

29 Measures Hierarchical Structure Filters Geometrical properties

30 Measures Hierarchical Structure Filters Geometrical properties Coordinates Default coordinates Extracted from the unit

31 Measures Hierarchical Structure Filters Geometrical properties Coordinates Default coordinates Extracted from the unit Distances Output distance Vote distance Tanimoto coefficient Custom distance

32 Measures Hierarchical Structure Filters Geometrical properties Coordinates Default coordinates Extracted from the unit Distances Output distance Vote distance Tanimoto coefficient Custom distance Both are linked

33 Filter shape Hierarchical Structure Filters Geometrically inspired Formula ω i = { f (Xi ) if X i is inside the filter 0 if X i is outside the filter (1) Contrast Real-valued weights Number Size

34 Single Hierarchical Structure Filters ω i{singlei } = { 1 if i = I 0 else (2) Trivial filter no geometry* normal filter for Decision Stumps size: 1 number: n

35 Rectangular Hierarchical Structure Filters { 1 if j (1, d), xi,j min(x ω i{rectij } = I,j, x J,j ) and x i,j max(x I,j, x J 0 else (3) Needs coordinates Generalization of Haar-like Features Contrast possible size: 2+ number: n 2 (dn 2 )

36 Rectangular (2) Hierarchical Structure Filters Figure: Rectangular filters, without and with contrast, and Haar-like feature

37 Double Hierarchical Structure Filters { 1 if i = I or i = J ω i{doubleij } = 0 else (4) Extension of the simple filter no geometry can be parametrized (with distance measures) size: 2 number: n 2

38 Circular Hierarchical Structure Filters { 1 if d(xi, x ω i{circi,r } = I ) r 0 else (5) Needs distances can be parametrized contrast possible gaussian possible size: 1..n number: n 2 (n 3 )

39 Circular (2) Hierarchical Structure Filters Figure: Circular filters, without and with contrast

40 Ellipsoidal Hierarchical Structure Filters { 1 if dellipse (x ω i{ellipsei,j,r } = i, x I, x J ) r 0 else (6) Needs distances can be parametrized contrast possible gaussian possible size: 2..n number: n 3

41 Ellipsoidal (2) Hierarchical Structure Filters Figure: Ellipsoidal filters, without and with contrast

42 Hierarchical Structure Filters Free (all) the other trivial filter no geometry* variable weights possible (perceptron) size: n number: 1(inf)

43 All hyper-parameters Hierarchical Structure Filters number of layers size of layers growth method measure type shape of filters parameters of the filters

44 USPS tests Standard USPS training/test set Baseline: simple decision stumps (upper bound) and haar-like features (lower bound) Simple decision stumps for first layer, and groups of two for the upper layers.

46 What s next Progressive Multi-task learning complete neural network Other datasets

47 Thank You!

48 Sources Breiman L, Arcing Classifiers, Annuals of Statistics, 26(3): , 1998 Freud Y, Schapire R E, Experiments with a New Boosting Algorithm, Machine Learning: Proceedings of the Thriteenth International Conference, 1996 Schapire R E, Singer Y, Improved Boosting Algorithms Using Confidence-rated Predictions, Machine Learning 37(3): , 1999 Drucker H, Schapire R, Simard P, Boosting performance in neural networks, International journal of pattern recognition and artificial intelligence, 7(4): , 1993 Viola P, Jones M, Robust Real-time Object Detection, International Journal of Computer Vision, 2002

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over