FloatBoost Learning for Classification

Size: px

Start display at page:

Download "FloatBoost Learning for Classification"

Joan Parker
6 years ago
Views:

1 loatboost Learning for Classification Stan Z. Li Microsoft Researc Asia Beijing, Cina Heung-Yeung Sum Microsoft Researc Asia Beijing, Cina ZenQiu Zang Institute of Automation CAS, Beijing, Cina HongJiang Zang Microsoft Researc Asia Beijing, Cina Abstract AdaBoost [] minimizes an upper error bound wic is an exponential function of te margin on te training set [1]. However, te ultimate goal in applications of pattern classification is always minimum error rate. On te oter and, AdaBoost needs an effective procedure for learning weak classifiers, wic by itself is difficult especially for ig dimensional data. In tis paper, we present a novel procedure, called loatboost, for learning a better boosted classifier. loatboost uses a backtrack mecanism after eac iteration of AdaBoost to remove weak classifiers wic cause iger error rates. Te resulting float-boosted classifier consists of fewer weak classifiers yet acieves lower error rates tan AdaBoost in bot training and test. We also propose a statistical model for learning weak classifiers, based on a stagewise approximation of te posterior using an overcomplete set of scalar features. xperimental comparisons of loatboost and AdaBoost are provided troug a difficult classification problem, face detection, were te goal is to learn from training examples a igly nonlinear classifier to differentiate between face and nonface patterns in a ig dimensional space. Te results clearly demonstrate te promises made by loatboost over AdaBoost. 1 Introduction Nonlinear classification of ig dimensional data is a callenging problem. Wile designing suc a classifier is difficult, AdaBoost learning metods, introduced by reund and Scapire [], provides an effective stagewise approac: It learns a sequence of more easily learnable weak classifiers, and boosts tem into a single strong classifier by a linear combination of tem. It is sown tat te AdaBoost learning minimizes an upper error bound wic is an exponential function of te margin on te training set [1]. Boosting learning originated from te PAC (probably approximately correct) learning teory [17, 6]. Given tat weak classifiers can perform sligtly better tan random guessing ttp://researc.microsoft.com/ szli Te work presented in tis paper was carried out at Microsoft Researc Asia.

2 on every distribution over te training set, AdaBoost can provably acieve arbitrarily good bounds on its training and generalization errors [, 15]. It is sown tat suc simple weak classifiers, wen boosted, can capture complex decision boundaries [1]. Relationsips of AdaBoost [, 15] to functional optimization and statistical estimation are establised recently. A number of gradient boosting algoritms are proposed [, 8, 21]. A significant advance is made by riedman et al. [5] wo sow tat te AdaBoost algoritms minimize an exponential loss function wic is closely related to Bernoulli likeliood. In tis paper, we address te following problems associated wit AdaBoost: 1. AdaBoost minimizes an exponential (some anoter form of ) function of te margin over te training set. Tis is for convenience of teoretical and numerical analysis. However, te ultimate goal in applications is always minimum error rate. A strong classifier learned by AdaBoost may not necessarily be best in tis criterion. Tis problem as been noted, eg by [2], but no solutions ave been found in literature. 2. An effective and tractable algoritm for learning weak classifiers is needed. Learning te optimal weak classifier, suc as te log posterior ratio given in [15, 5], requires estimation of densities in te input data space. Wen te dimensionality is ig, tis is a difficult problem by itself. We propose a metod, called loatboost (Section ), to overcome te first problem. loat- Boost incorporates into AdaBoost te idea of loating Searc originally proposed in [11] for feature selection. A backtrack mecanism terein allows deletion of tose weak classifiers tat are non-effective or unfavorable in terms of te error rate. Tis leads to a strong classifier consisting of fewer weak classifiers. Because deletions in backtrack is performed according to te error rate, an improvement in classification error is also obtained. To solve te second problem above, we provide a statistical model (Section ) for learning weak classifiers and effective feature selection in ig dimensional feature space. A base set of weak classifiers, defined as te log posterior ratio, are derived based on an overcomplete set of scalar features. xperimental results are presented in (Section 5) using a difficult classification problem, face detection. Comparisons are made between loatboost and AdaBoost in terms of te error rate and complexity of boosted classifier. Results clear sow tat loatboost yields a strong classifier consisting of fewer weak classifiers yet acieves lower error rates. 2 AdaBoost Learning In tis section, we give a brief description of AdaBoost algoritm, in te notion of Real- Boost [15, 5], as opposed to te original discrete AdaBoost []. or two class problems, a set of labelled training examples is given as "#%$'&!, were is te class label associated wit example. A stronger classifier is a linear combination of ( weak classifiers )+,.- / In tis real version of AdaBoost, te weak classifiers can take a real value, 0,+6$ and ave absorbed te coefficients needed ),:-<; =?>@BA in te discrete )C D version (tere, 0,789 Te class label for is obtained as wile te magnitude )C, indicates te confidence. very training example is associated wit a weigt. During te learning process, te weigts are updated dynamically in suc a way tat more empasis is (1), ).

3 } } p j } } placed on ard examples wic are erroneously classified previously. It is important for te original AdaBoost. However, recent studies [, 8, 21] sow tat te artificial operation of explicit re-weigting is unnecessary and can be incorporated into a functional optimization procedure of boosting. 0. (Input) (1) Training examples, were! ; of wic examples ave #"%$ and examples ave " '&$ ; (2) Te maximum number (),+.- of weak classifiers to be combined; 1. (Initialization) /102 /102 " 576 for tose examples wit #"89%$ or " 5.: for tose examples wit " '&$. (;=< ; 2. (orward Inclusion) wile (?>@( ),+.- (1) (;AB(C@$ ; /10 /10 " AGIHKJMLN&O "QP R " TS, and normalize to U " " V$ ; (2) Coose D according to q.; () Update. (Output) P W,XY[Z]\^L U _a` D _ TS. igure 1: RealBoost Algoritm. ) 5 b - )C,Rced An error occurs,7%$ wen, or. Te margin, of an example acieved by on te training set examples is defined as. Tis can be considered as a measure of )+ te confidence of te s prediction. Te upper bound on classification error acieved by can be derived as te following exponential loss function [1] 5 0, f ) #- / Vg^ijQkal AdaBoost construct by stage-wise minimization of q.(2). ) -rq ), te best is te one wic leads to te minimum cost -tsu >av w#x f ) It is sown in [15, 5] tat te minimizer is were m and letting we arrive p mon (2) Given te current for te new strong classifier ), >~} +- m y{zo +- m are te weigts given at time (. Using } > y zƒ >~} y zo -, () p p () -, (5) ˆ (6) Te alf log likeliood ratio is learned from te training examples of te two classes, and te tresold is determined by te log ratio of prior probabilities. can be adjusted (7)

4 l l l to balance between detection rate and false alarm (ROC curve). Te algoritm is sown in ig.1 (Note: Re-weigt formula in tis description is equivalent to te multiplicative rule in te original form of AdaBoost [, 15]). In Section, we will present an model for approximating } 5, m p. loatboost Learning loatboost backtracks after te newest weak classifier is added and delete unfavorable weak classifiers 0 from te ensemble (1), following te idea of loating Searc [11]. loating Searc [11] is originally aimed to deal wit non-monotonicity of straigt sequential feature selection, non-monotonicity meaning tat adding an additional feature may lead to drop in performance. Wen a new feature is added, backtracks are performed to delete tose features tat cause performance drops. Limitations of sequential feature selection is tus amended, improvement gained wit te cost of increased computation due to te extended searc. 0. (Input) (1) Training examples, were ; of wic examples ave "89%$ and examples ave "^'&$ ; (2) Te maximum number (),+.- of weak classifiers; () Te error rate P, and te acceptance tresold. 1. (Initialization) / 02 (1)/ 02 " 576 for tose examples wit " 9%$ or " 5.: for tose examples wit "8 &$ ; (2) ) _ max-value (for '$ I ( ),+.- ), (;=<, (orward Inclusion) (1) (;AB(C@$ ; (2) Coose/ D0 according to q.; / 0 () Update " AGIHKJMLN&O "QP R " TS, and normalize to U " " V$ ; () Ë R D ; If ) P ~, ten ) P ;. (Conditional xclusion) (1) DZ Y[\ P &D ; (2) If P &D > ) Ë 8, ten (a) Ë 8 &D ; ) P &D ; (;( & $ ; (b) P D ; 9U (c) goto.(1); () else (a) if / (;9(),+.- 0 or ~a> (b) " A HJMLN&O "QP R " TS ; goto 2.(1); P W,XY[Z]\^L U 0 D TS.. (Output), ten goto ; igure 2: loatboost Algoritm. - ) 5.- q be te error rate acieved by Te loatboost procedure is sown in ig.2 Let be te so-far-best set of ( ) weak classifiers; 021 0, (or a weigted sum of missing rate and false alarm rate wic is usually te criterion in one-class detection problem); "! 0 # be te minimum error rate acieved so far wit an ensemble of $ weak classifiers. In Step 2 (forward inclusion), given already selected, te best weak classifier is added one

5 ( - at a time, wic is te same as in AdaBoost. In Step (conditional exclusion), loatboost removes te least significant weak classifier from, subject to te condition tat te removal leads to a lower error rate "! #. Tese are repeated until no more removals f can be done. Te procedure terminates wen te risk on te training set is below or te is reaced. maximum number ( Incorporating te conditional exclusion, loatboost renders bot effective feature selection and classifier learning. It usually needs fewer weak classifiers tan AdaBoost to acieve te same error rate. Learning Weak Classifiers Te section presents a metod for computing te log likeliood in q.(5) required in learning optimal weak classifiers. Since deriving a weak classifier in ig dimensional space is a non-trivial task, ere we provide a statistical model for stagewise learning of weak classi- fiers based on some scalar features. A scaler feature 5 :8$ of is computed by a transform from te -dimensional data space to te real line,. A feature can be te coefficient of, say, a wavelet, transform in signal and image processing. If projection pursuit is used as te transform, is simply - te, -t coordinate of. A dictionary of candidate scalar features can be created. In te following, we use m 0 p, to denote te feature selected in te $ -t stage, wile is te feature computed from using te -t transform. Assuming tat is an over-complete basis, a set of candidate weak classifiers for te optimal weak classifier (7) can be designed in te following way: irst, at stage ( were features m p mip m p m p ave been selected and te weigt is given as, 5, m p we can approximate by using te distributions of ( features Because, m p m p mp m p, m p m p, m p mip m p m p m p, m p m Ip m p m p m m p is an over-complete basis set, te approximation is good enoug for large enoug ( and wen te ( features are cosen appropriately. Note tat m 0 p, m p m 0 p is actually m 0 p, m 0 p m 0 p because contains te information about entire istory of and accounts for te dependencies on m p m 0 p. Terefore, we ave, m p m p, mp mp, m p (10) m p, m Ip, m p (11) On te rigt-and side of te above equation, all conditional densities are fixed except te last one, m p +. Learning te best weak classifier at stage ( is to coose te best feature m p for f suc tat is minimized according to q.(). Te conditional probability densities, m - p - for te positive class and te negative class can be estimated using te istograms m p computed from te weigted voting of te training examples using te weigts. Let m p,.- >1 - m p y{zƒ - m p (12) (8) (9)

f f and m p,.- m p,,. We can derive te set of candidate weaker classifiers as m p - m p among all in m (1) p for te newm ) 5 - strong p classifier Recall tat te best ) 5 is given by q.

6 f f and m p,.- m p,,. We can derive te set of candidate weaker classifiers as m p - m p among all in m (1) p for te newm ) 5 - strong p classifier Recall tat te best ) 5 is given by q.() among all, for wic te optimal weak classifier as been derived as (7). According te teory of gradient based boosting [, 8, 21], we can coose te optimal weak classifier by finding te tat best fits te gradient f ) were f )+.-, " Inm our p stagewise approximation formulation, tis can be done by first finding te tat best fits f in direction and ten scaling it so tat te two as te same (re-weigted) norm. An alternative selection sceme is simply to coose so tat te error rate (or some risk), computed from te two istograms - m p - m p and, is minimized. 5 xperimental Results (1),7 ace Detection Te face detection problem ere is to classifier an image of standard size (eg 20x20 pixels) into eiter face or nonface (imposter). Tis is essentially a one-class problem in tat everyting not a face is a nonface. It is a very ard problem. Learning based metods ave been te main approac for solving te problem, eg [1, 16, 9, 12]. xperiments ere follow te framework of Viola and Jones [19, 18]. Tere, AdaBoost is used for learning face detection; it performs two important tasks: feature selection from a large collection features; and constructing classifiers using selected features. Data Sets A set of 5000 face images are collected from various sources. Te faces are cropped and re-scaled to te size of 20x20. Anoter set of 5000 nonface examples of te same size are collected from images containing no faces. Te 5000 examples in eac set is divided into a training set of 000 examples and a test set of 1000 examples. See ig. for a random sample of 10 face and 10 nonface examples. Scalar eatures igure : ace (top) and nonface (bottom) examples. Tree basic types of scalar features are derived from eac example, as sown in ig., for constructing weak classifiers. Tese block differences are an extended set of steerable filters, used in [10, 20]. Tere are undreds of tousands of different for admissible values. ac candidate weak classifier is constructed as te log likeliood ratio (12) computed from te two istograms, m p of a scalar feature for te face ( - ) and nonface ( - ) examples (cf. te last part of te previous section).

igure : Te tree types of simple : Harr wavelet like features mp defined on a sub-window. Te rectangles are of size and are y at distances of apart.

Te performance is measured by false alarm error rate given te detection rate fixed at 99.5%.

classifiers. Tis is because wat we aim to evaluate ere is to contrast between loatboost and AdaBoost learning algoritms, rater tan te system work.

7 igure : Te tree types of simple : Harr wavelet like features mp defined on a sub-window. Te rectangles are of size and are y at distances of apart. ac feature takes a value calculated by te weigted ( ) sum of te pixels in te rectangles. Performance Comparison Te same data sets are used for evaluating loatboost and AdaBoost. Te performance is measured by false alarm error rate given te detection rate fixed at 99.5%. Wile a cascade of stronger classifiers are needed to aciever very low false alarm [19, 7], ere we present te learning curves for te first strong classifier composed of up to one tousand weak classifiers. Tis is because wat we aim to evaluate ere is to contrast between loatboost and AdaBoost learning algoritms, rater tan te system work. Interested ]d reader is referred to [7] for a complete system wic acieved a false alarm of wit te detection rate of 95%. (A live demo of multi-view face detection system, te first real-time system of te kind in te world, is being submitted to te conference) AdaBoost train AdaBoost test loatboost train loatboost test rror Rates # Weak Classifiers igure 5: rror Rates of loatboost vs AdaBoost for frontal face detection. Te training and testing error curves for loatboost and AdaBoost are sown in ig.5, wit te detection rate fixed at 99.5%. Te following conclusions can be made from tese curves: (1) Given te same number of learned features or weak classifiers, loatboost always acieves lower training error and lower test error tan AdaBoost. or example, on te test set, by combining 1000 weak classifiers, te false alarm of loatboost is 0.27 versus 0.85 of AdaBoost. (2) loatboost needs many fewer weak classifiers tan AdaBoost in order to acieve te same false alarms. or example, te lowest test error for AdaBoost is 0.81 wit 800 weak classifiers, wereas loatboost needs only 20 weak classifiers to acieve te same performance. Tis clearly demonstrates te strengt of loatboost in

8 learning to acieve lower error rate. 6 Conclusion and uture Work By incorporating te idea of loating Searc [11] into AdaBoost [, 15], loatboost effectively improves te learning results. It needs fewer weaker classifiers tan AdaBoost to acieve a similar error rate, or acieves lower a error rate wit te same number of weak classifiers. Suc a performance improvement is acieved wit te cost of longer training time, about 5 times longer for te experiments reported in tis paper. Te Boosting algoritm may need substantial computation for training. Several metods can be used to make te training more efficient wit little drop in te training performance. Noticing tat only examples wit large weig values are influential, riedman et al. [5] propose to select examples wit large weigts, i.e. tose wic in te past ave been wrongly classified by te learned weak classifiers, for te training weak classifier in t+- e next round. Top examples witin a fraction of of te total weigt mass are used, were 88A d d Id?D. References [1] L. Breiman. Arcing classifiers. Te Annals of Statistics, 26():801 89, [2] P. Bulmann and B. Yu. Invited discussion on Additive logistic regression: a statistical view of boosting (friedman, astie and tibsirani). Te Annals of Statistics, 28(2):77 86, April [] Y. reund and R. Scapire. A decision-teoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 19, Aug [] J. riedman. Greedy function approximation: A gradient boosting macine. Te Annals of Statistics, 29(5), October [5] J. riedman, T. Hastie, and R. Tibsirani. Additive logistic regression: a statistical view of boosting. Te Annals of Statistics, 28(2):7 7, April [6] M. J. Kearns and U. Vazirani. An Introduction to Computational Learning Teory. MIT Press, Cambridge, MA, 199. [7] S. Z. Li, L. Zu, Z. Q. Zang, A. Blake, H. Zang, and H. Sum. Statistical learning of multi-view face detection. In Proceedings of te uropean Conference on Computer Vision, page???, Copenagen, Denmark, May 28 - June [8] L. Mason, J. Baxter, P. Bartlett, and M. rean. unctional gradient tecniques for combining ypoteses. In A. Smola, P. Bartlett, B. Scölkopf, and D. Scuurmans, editors, Advances in Large Margin Classifiers, pages MIT Press, Cambridge, MA, [9]. Osuna, R. reund, and. Girosi. Training support vector macines: An application to face detection. In CVPR, pages 10 16, [10] C. P. Papageorgiou, M. Oren, and T. Poggio. A general framework for object detection. In Proceedings of I International Conference on Computer Vision, pages , Bombay, India, [11] P. Pudil, J. Novovicova, and J. Kittler. loating searc metods in feature selection. Pattern Recognition Letters, (11): , 199. [12] D. Rot, M. Yang, and N. Auja. A snow-based face detector. In Proceedings of Neural Information Processing Systems, [1] H. A. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection. I Transactions on Pattern Analysis and Macine Intelligence, 20(1):2 28, [1] R. Scapire, Y. reund, P. Bartlett, and W. S. Lee. Boosting te margin: A new explanation for te effectiveness of voting metods. Te Annals of Statistics, 26(5): , October [15] R.. Scapire and Y. Singer. Improved boosting algoritms using confidence-rated predictions. In Proceedings of te levent Annual Conference on Computational Learning Teory, pages 80 91, [16] K.-K. Sung and T. Poggio. xample-based learning for view-based uman face detection. I Transactions on Pattern Analysis and Macine Intelligence, 20(1):9 51, [17] L. Valiant. A teory of te learnable. Communications of ACM, 27(11):11 112, 198. [18] P. Viola and M. Jones. Asymmetric AdaBoost and a detector cascade. In Proceedings of Neural Information Processing Systems, Vancouver, Canada, December [19] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proceedings of I Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, Hawaii, December [20] P. Viola and M. Jones. Robust real time object detection. In I ICCV Worksop on Statistical and Computational Teories of Vision, Vancouver, Canada, July [21] R. Zemel and T. Pitassi. A gradient-based boosting algoritm for regression problems. In Advances in Neural Information Processing Systems, volume 1, Cambridge, MA, MIT Press.

Polynomial Interpolation

Capter 4 Polynomial Interpolation In tis capter, we consider te important problem of approximatinga function fx, wose values at a set of distinct points x, x, x,, x n are known, by a polynomial P x suc