DESIGNING RBF CLASSIFIERS FOR WEIGHTED BOOSTING

Size: px

Start display at page:

Download "DESIGNING RBF CLASSIFIERS FOR WEIGHTED BOOSTING"

Thomas Cook
5 years ago
Views:

1 DESIGNING RBF CLASSIFIERS FOR WEIGHTED BOOSTING Vanessa Gómez-Verdejo, Jerónimo Arenas-García, Manuel Ortega-Moral and Aníbal R. Figueiras-Vidal Department of Signal Theory and Communications Universidad Carlos III de Madrid Avda. Universidad 30, 8911, Leganés (Madrid), SPAIN. Abstract The recent interest in combining Neural Networks has produced a variety of techniques. This paper deals with boosting methods, in particular, Real AdaBoost schemes built up with Radial Basis Function Networks. Real Adaboost emphasis function can be divided into two different terms, the first only focus on the quadratic error of each pattern and the second only takes into account its proximity to the boundary. Incorporating to this fixed emphasis function an additional degree of freedom, that allows us to weight these two terms, and also to select the Radial Basis Functions centroids according to the emphasized regions, we show performance improvements: an error rate reduction, a faster convergence, and overfitting robustness. I. INTRODUCTION Multi-net systems are a good approach to solve difficult tasks which usually require a very complex net, overcoming sizing and training difficulties. Consequently, during the last years there has been an intensive research work to design Neural Networks (NN) ensembles, following different approaches [14]. Among all these procedures, boosting schemes [1], and in particular Real Adaboost (RA), [] have demonstrated excellent performance. Boosting designs have their roots in the consideration of weak learners in the light of the Probability Approximately Correct Learning Theory [15]. In fact, these designs have become very popular as a way to obtain advantage of weak learners. Concretely, RA works by adding sequentially a new base learner trained with an emphasized population, mainly paying its attention on the most erroneous samples (a detailed description can be found in [13]). During the last years, many researchers have studied RA convergence properties, proposing different alternatives to improve its performance, making the list of applications where boosting methods are employed grow rapidly [8]. Some of us have also chosen this research line, concretely, we have studied how RA emphasis works, getting some interesting results. For instance, Arenas-García et al. showed in [6] that, although in an indirect manner, these schemes end by concentrating their attention on the samples near the classification boundary. Consequently, in [5] we analyzed how the performance of these schemes can be improved by focusing directly on the samples near the boundary. Furthermore, in this last work, we also studied in detail the RA emphasis function, showing that this function is made up of two different terms which are combined in a fixed way: the first pays attention to the quadratic error of each pattern; the other takes into account its proximity to the boundary. In [4] we show that an alternative emphasis function adding a variable mixing parameter allows to combine both terms in a more flexible manner and provides some advantages. We showed, by studying the performance of the new approach in Multi Layer Perceptron (MLP) based RA scheme, that usually the best way to resample the population is emphasizing neither most erroneous samples nor boundary ones, but a particular tradeoff between them. In this paper we go deeper into this idea; first, we show that similar advantages to those reported when using MLPs can be obtained when boosting Radial Basis Function Networks (RBFN). Second, to maximize the weighting emphasis benefit, we propose a method to appropriately design the Radial Basis Functions (RBFs) of the base classifiers according to the emphasized regions. As we check in several experiments, selecting RBF centroids according to the kind of emphasis that is needed to get these ensembles offer a good performance. The rest of the paper is organized as follows: first, the classical RA algorithm will be described, paying special attention to its emphasis function; in Sections III and IV we present the weighted emphasis function and the proposed algorithm to design RBFN for boosted ensembles, respectively. Section V is devoted to check the validity of these approaches in some benchmark problems. Finally, in Section VI, conclusions and future research lines will be presented. II. REAL ADABOOST The fundamental idea of RA is to combine several weak learners in such a way that the ensemble improves their performance. To build up an RA classifier, at each round t = 1,..., T a new base learner is added implementing a function o t (x i ) : X [ 1, 1] aimed to minimize the following error function L Et = D t (i)(d i o t (x i )) (1) where L is the number of training patterns, d i { 1, 1} is the target for pattern x i, o t (x i ) is the weak learner output

2 for x i, and D t (i) is the weight that the t-th learner emphasis function assigns to x i. Initially, all weights have the same value D 1 (i) =1/L, i =1,..., L, and they are then updated according to D t+1 (i) = D t(i)exp( α t o t (x i )d i ) () where is a normalization factor assuring that L D t(i) = 1, and α t is the weight assigned to the t-th weak learner. The overall output of the net, f T (x), is calculated as the weigthed combination of all learners: f T (x) = T α t o t (x) (3) t=1 weights α t are calculated at each round according to α t = 1 ( ) ln 1+rt 1 r t where r t, typically named edge of the the t-th learner, is r t = (4) D t (i)o t (x i )t i (5) This choice of α t values assures that the following training error bound is minimized E train = sign(f(x i )) t i exp( f t (x i )t i ) (6) Furthermore, as it was demonstrated in [9], this same criterion maximizes the classification margin defined as ρ = min...l f t(x i )d i (7) As explained [5], () can be alternatively expressed as D t+1 (x i )= 1 ( (ft (x i ) d i ) ) ( exp exp f ) t (x i ) (8) where each factor stands for a different kind of emphasis: Error emphasis Boundary emphasis ( (ft (x i ) t i ) ) exp ( exp f t (x ) i) (9) (10) As summary, RA emphasis function () can be seen as fixed combination of two terms, one that pays attention to the quadratic error of each pattern (9), and other that takes into account its proximity to the boundary (10). III. A WEIGHTED EMPHASIS FUNCTION In the light of (8), one may wonder if this fixed combination of emphasis terms is optimal in all situations. In fact, as we have already checked in [4], different combinations of these terms can improve the ensemble performance in general situation. There, we proposed to combine (9) and (10) by means of a weighting parameter λ (0 λ 1), D λ,t+1 (i) = 1 exp ( λ (f t (x i ) t i ) (1 λ) ft (x i ) ) (11) This more flexible combination allows to pay more or less attention to the boundary proximity or to the quadratic error of each sample by selecting different values λ. Three special cases, corresponding to different values λ, should be remarked: i) λ =0: only proximity to the boundary is taken into account. D λ=0,t+1 (i) = 1 exp ( f t (x i) ) (1) ii) λ =0.5: the classical RA emphasis function () is used. iii) λ =1: quadratic error only is considered. D λ=1,t+1 (i) = 1 exp ( (f t (x i ) t i ) ) (13) In this paper we study the effects of using this weighted emphasis function for the design of a boosted ensemble built up by local classifiers, concretely, RBFNs. In [4] we showed that a good selection of the weighting parameter can reduce the error rate, accelerate the convergence and, even, avoid the overfitting problem, when building up MLP base multi-net systems. However, when the weak learners are RBFNs we have to design them carefully in order to make the most of the weighted emphasis; as we realized in [5], studying the first two particular cases 1, it is necessary to select the centroids and to adjust its dispersion parameter according to the kind of emphasis. So, in order to solve this problem and, therefore, improve the performance of RBFN base boosted ensembles, in the next section we present a new method to design the RBFs according the emphasized regions. IV. BUILDING RBFNS FROM EMPHASIZED DATASET An RBF network consists on a group of K basis functions linearly combined to get the network output K o t (x) = w k g k (x) (14) k=1 where g k ( ) is a function with circular symmetry. We will typically use Gaussian kernels with centers c k and dispersion parameters β k g k (x) =exp ( x c k ) (15) β k When we build up a boosted ensemble by combining RBFNs, we train their output weights, w k, in order to minimize (1), but we have also to design the input layer, i.e., 1 We have tested a normalized version of (11) when λ =0and λ =0.5.

3 we have to select centroids c k and dispersion parameters β k. As we have already pointed out, it is quite adequate selecting these parameters according to the emphasized regions. So, to design radial functions g k ( ) we propose to select both centers and dispersion parameters according to probability distribution D centers ; for the case of binary classification, this method can be summarized as follows: 1) We use D centers distribution to independetly select K 1 centroids from the samples corresponding to class C 1, and K 1 centroids from those with positive targets, where K 1 and K 1 are chosen according to the a priory probability of each class. ) Now, for each centroid, we calculate the Euclidean distances, weighted according to D centers, from centers c k to all the patterns attributed to that centroid, where dist(x i, c k )=N ck D centers,k (x i ) c k x i (16) D centers (x i ) D centers,k (x i )= (17) x i C k D centers (x i ) C k is the set of all patterns belonging to the centroid c k, and N ck is the number of patterns in C k. 3) Finally we calculate the dispersion parameter of each centroid, β k,as β k = µ k (18) σ k where µ k and σ k are, respectively, the mean and standard deviation of dist(x i, c k ), calculated as µ k = 1 dist(x i, c k ) (19) N ck x i C k 1 σ k = (dist(x i, c k ) µ k ) (0) N ck x i C k V. EXPERIMENTS To show the performance of combining a weighted emphasis with an adequate RBF design, we have built a series of ensembles and we have evaluated them when applied to several binary problems. In particular, we have selected four binary problems from [1]: Abalone (a multiclass problem converted to binary according to [11]), Breast, Image and Ionosphere; and two synthetic problems: Kwok [7] and Ripley [10]. In Table I we have summarized their main features: number of dimensions (dim) and number of samples of each class {C 1 /C 1 } in the training and test set. Some of the problems have a predefined test set; when this was not the case, ten random partitions with 40% of the data set have been selected to test the performance of the classifier. Each of these ensembles is a linear combination of RBFs with input layer designed according to the method proposed in Section IV, and output weights trained to minimize cost function (1) by means of a sthocastic gradient descent algorithm with a learning step linearly decreasing from 0.1 to 0 along TABLE I MAIN FEATURES OF THE BENCHMARK PROBLEMS. Problem dim # Train samples # Test samples Abalone 8 138/ /87 Breast 9 75/ /96 Image 18 81/ /93 Ion 34 75/ /96 Kwok 300/00 610/4080 Ripley 15/15 500/ epochs. The selection of ensemble output weights α t is done according to (4); in this way, we are still minimizing a bound on the training error, and simultaneously maximizing the classifier margin, as RA does. To stand out the importance of an optimal tradeoff between error and boundary emphasis, we have built different ensembles emphasizing the RBF design and the output layer training according to (11), where λ has been explored from 0 to 1 with a step of 0.1. Table II displays the mean value and, between brackets, the standard deviation of test errors averaged over 50 independent runs, for classical RA (λ =0.5) and the weighting parameter that achieves the best result, denoted as λ 0. For each binary problem, we have used a different number of rounds T, in order to assure a complete convergence of the ensemble, and RBFs with different representational power, concretely, % and 10% of the number of training data have been selected as centroids (denoted as %K in the Tables). Furthermore, to measure the statistical importance of the different approaches, we have used the Wilcoxon Rank Test (WRT) [3], where a value lower than 0.1 indicates that the differences between them are significant 3 ; on the contrary, this value is close to 1 when there is no statistical difference between the two error rates. It can be seen that the fixed combination of the emphasis terms that RA employs is not always the best tradeoff. In some problems emphasis focused mainly on the boundary patterns (λ <0.3) achive a significant error rate reduction, for instance, in Abalone, Kwok, or in Image when the percentange of centers of the weak learners is %. In other problems, like Ripley, it is better to concentrate the emphasis on the most erroneous patterns. An optimal tradeoff between the emphasis terms not only reduces the error rate, but also can avoid the overfitting problem that RA presents and it can accelerate the algorithm convergence. For instance, in Ionosphere (see Figure 1), we get a higher error rate reduction for large λ values, while lower λ values offer lower error rates. In order to show the improvement achieved boosting the center selection, we present in Table III the error rates obtained if we only emphasize the population to train the output layer This number of epochs ensures the network convergence. 3 Values lower than 0.01 have been rounded down to 0.

4 TABLE II TABLE III TEST ERRORS DESIGNING RBFS CENTROIDS WITH EMPHASIS Prob. %K T E RA λ 0 E λ0 WRT TEST ERRORS DESIGNING RBFS CENTROIDS WITHOUT EMPHASIS Prob. %K T E RA λ 0 E λ0 WRT Abal (0.3) (0.7) (0.0) (0.5) 0 Abal (0.39) (0.41) (0.37) (0.34) 0.31 Brea (0.93) (0.74) (0.85) (0.88) 0.49 Brea (0.85) (0.8) (0.76) (0.83) 0.41 Imag (0.1) (0.1) (0.3) (0.1) 0.01 Imag (0.48) (0.54) (0.37) (0.3) 0 Ion (0.50) (0.50) (0.3) (0.3) 0 Ion (1.54) (0.40) (0.73) (0.73) 1 Kwok (0.33) (0.17) (0.14) (0.11) 0 Kwok (0.9) (0.19) (0.13) (0.09) 0 Rip (0.34) (0.) (0.17) (0.) 0 Rip (0.49) (0.6) (0.1) (0.5) 0 Test error in Ionosphere (% of centers) E tst λ=0. 30 λ=0.5 5 λ= TABLE IV TEST ERRORS SEPARATING THE RBFN LAYER EMPHASIS Probl. %K T E RA λ out λ cen E λout,λcen Abal (0.3) (0.31) Kwok (0.33) (0.13) Fig. 1. Evolution of the test error for λ 0., 0.5 and 1 of the RBFNs; i.e., we have designed the RBF according to the method presented in Section IV, but, now, probability distribution D centers is always a uniform distribution, independent from the output layer emphasis function that remains applied according to (11). The proposed method offers for every problem but Abalone an error rate improvement, reducing both its mean value and its standard deviation. Furthermore, in problems like Ripley, Image or Ionosphere, a faster convergence can be observed. Finally, trying to improve even more the boosted ensemble performance, we have separated the weighting parameter of the output layer from that used for the centroid design; i.e., we apply the emphasis function (11) in both training procedures, but different weighting parameters: λ out for training the RBFN output layer and λ cen for emphasizing the centroids selection. This approach has been tested in two particular problems, Abalone and Kwok, when the RBFNs have as centroids 10% of the training data. In particular, several ensembles have been built varying weighting parameters λ out and λ cen in range [0, 1] with a step of 0.1; the best error rates of this approach are shown in Table IV. Finally, we can see in Figure, where the error test in the problem Kwok according to λ out and λ cen is depicted, T how by tuning these parameters we can modify considerably the ensemble performance. For example, when the centroids design is boosted focusing on the most erroneous patterns, λ cen > 0.9, the error rate is about 1%, and λ out value introduces slight changes; whereas, when the centroids are placed on the boundary, λ out value is critical, and it can change the error rate from 1% to 13.45%. Of course, it is necessary a good tradeoff between these parameters to get the best results: selecting the centroids according to the most erroneous patterns (λ cen =0.9) and focusing the training of the RBF output layer on the patterns nearest to the boundary (λ out =0.), we get an error rate of 11.9%. E tst λ output layer 0 Fig.. Test error in Kwok (10% centers) λ centers Test error with regard to λ cen and λ out

5 VI. CONCLUSIONS AND FUTURE WORK In this paper, we check that using different tradeoffs between error and boundary emphasis in boosting methods provides performance improvements. We have specially pointed out the influence of this new boosting strategy on the design of multi-net systems, when local classifiers, such as RBFNs, are used as base learners. In this case, to get the most from emphasis methods it is necessary to apply modified criteria for selecting the centroids and dispersion parameters of the RBFs. Experiments show, in several benchmark problems, that a good selection of the weighting parameter together with a good design of the basis functions, can reduce the error rate, accelerate the convergence of the ensemble, and avoid the overfitting problem of RA schemes. Finally, we have shown how these results can be enhanced even more if the emphasis applied to the RBF output layer training is separated from the emphasis used for designing the basis functions. These evidences suggest the appropriateness of designing automatic methods to select the optimal value of the weighting parameters. Even more, it would be very interesting to adapt weighting values along the ensemble growing in order to select in the first rounds λ values that get a faster convergence, and in the last rounds λ values that allow a lower error rate. These ideas constitute a promising research line where we are currently working. [11] A. Ruiz and P. E. López de Teruel. Nonlinear kernels-based statistical pattern analysis. IEEE Trans. on Neural Networks, 1(1):16 3, 001. [1] R. E. Schapire. The strength of weak learnability. In 30th Annual Symposium on Foundations of Computer Science, pages 8 33, [13] R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):97 336, [14] A. J. C. Sharkey. Combining Artificial Neural Nets. Ensemble and Modular Multi-Net Systems. Springer-Verlag, London, UK, [15] L. Valiant. Theory of the learnable. Communications of the ACM, 7: , ACKNOWLEDGMENT This work has been partly supported by CICYT grant TIC The work of V. Gómez-Verdejo was also supported by the Education Department of the Madrid Community by a scholarship. REFERENCES [1] C. L. Blake and C. J. Merz. UCI repository of machine learning databases. mlearn/mlrepository.html, University of California, Irvine, Dept. of Information and Computer Sciences. [] Y. Freud and R. E. Schapire. Experiments with a new boosting algorithm. In Proc. 13th Intl. Conf. Machine Learning, pages , Bari, Italy, [3] J. D. Gibbons. Nonparametric Statistical Inference. Basel : Marcel Dekker, New York, 4th edition, 003. [4] V. Gómez-Verdejo, M. Ortega-Moral, J. Arenas-García, and A. Figueiras-Vidal. Boosting by weighting boundary and erroneous samples. In To be presented at ESSAN 05, Bruges, Belgium, April 005. [5] V. Gómez-Verdejo, M. Ortega-Moral, J. P. Cabrera, J. Arenas-García, and A. Figueiras-Vidal. Boosting by emphasizing boundary samples. In Proc. of the Learning 04 Intl. Conf., pages 67 7, Elche, Spain, 004. [6] A. J. C. Sharkey J. Arenas-García, A. R. Figueiras-Vidal. The beneficial effects of using multi net systems that focus on hard patterns. In F. Rolli T. Windeatt, editor, Multi Classifier Systems (Proc. 4th Intl. Workshop), pages 45 54, Survey, U.K., 003. Springer-Verlag (LNCS). [7] J. T. Kwok. Moderating the output of support vector classifiers. IEEE Trans. on Neural Networks, 10(5): , [8] R. Meir and G. Ratsch. An introduction to boosting and leveraging. In S. Mendelson and A. Smola, editors, Advanced Lectures on Machine Learning, LNCS, pages Springer Verlag, 003. [9] G. Ratsch., T. Onoda, and K. R. Muller. Soft margins for AdaBoost. Machine Learning, 4(3):87 30, March 001. [10] B. D. Ripley. Neural networks and related methods for classification (with discussion). J. Roy. Statist. Soc. Series, 56: , 1994.

Support Vector Machine via Nonlinear Rescaling Method

Manuscript Click here to download Manuscript: svm-nrm_3.tex Support Vector Machine via Nonlinear Rescaling Method Roman Polyak Department of SEOR and Department of Mathematical Sciences George Mason University