Performance evaluation of Bagging and Boosting in nonparametric regression

Size: px

Start display at page:

Download "Performance evaluation of Bagging and Boosting in nonparametric regression"

Reynold Dennis
6 years ago
Views:

1 SIMONE BORRA (*) AGOSTINO DI CIACCIO (**) Performance evaluation of Bagging and Boosting in nonparametric regression Contents: 1. Introduction. 2. Bagging and Boosting in Projection Pursuit Regression. 3. Simulation study on real data sets. 4. Conclusions. References. Summary. Riassunto. Key words. 1. Introduction In the last years new algorithms have been proposed in literature in order to improve the accuracy of nonparametric classifiers or to improve the prediction capabilities of nonparametric regression methods. In this context very promising empirical results have been found using procedures based on the assembling of simple predictors obtained resampling on the training sample. Two relevant procedures are Bagging (Breiman, 1996) and Boosting (Freunde & Schapire, 1997). Simulation studies on real and artificial data sets proved that, using Bagging or Boosting with Decision Trees (for example, CART) or with Neural Networks, it is possible to obtain an appreciable reduction of prediction error or classification error (H. Drucker & C. Cortes, 1996; Y. Freunde & R. Schapire, 1996; J. R. Quinlan, 1996; H. Schwenk & Y. Bengio, 1998). These techniques can now be considered a valid alternative to the pruning method, common used in Decision Trees. With regard to the performance comparison of Bagging and Boosting, some simulation studies, concerning in particular the use of CART, showed that Boosting works better than Bagging (L. Breiman, 1998a; E. Bauer & R. Kohavi, 1998; T.G. Dietterich, 1998). (*) University of Rome Tor Vergata - Dept. Sefemeq (**) University of Urbino - Faculty of Economics

2 142 In this paper we consider the prediction capability of nonparametric regression methods and particularly we compare the use of Boosting and Bagging conjointly with the Projection Pursuit Regression, as proposed in Borra & Di Ciaccio (2001). In particular, in the present simulation study, we use well known real data sets, often considered in other papers, to allow a comparison with other approaches presented in literature. We remark that the choice to consider both real and artificial data sets depends on the different information you can get by simulation studies: with artificial data, fairly regular, we can study the influence of the parameters on the prediction capability of a method; with real data, generally more complex and irregular, we can observe the effective prediction capability in empirical studies. The results obtained in this work shows that both techniques are efficient also when they are applied with more complex models then the weak learners more often considered in literature (for example the Decision Trees). 2. Bagging and Boosting in Projection Pursuit Regression Consider a nonparametric regression problem Y = f (X)+ε, where we suppose that a response variable Y is related to p predictor variables X ={X 1, X 2,...,X p } by an unknown nonlinear function f (X) defined in a such way that ε is a noise component with E(ε) = 0. We don t suppose f (X) as the true link function between Y and the predictor variables, but we consider f (X) a function useful to reconstitute the values of Y starting from the values of the predictor variables. The objective is not to explain the link existing between the variables but to obtain a tool useful in a prediction problem. In this context, we will evaluate the choice of the estimator of the function f (X) in terms of its prediction capability. Many strategies have been explored to improve the prediction capability in presence of nonlinear associations, for example cross-validation and the methods based on data perturbation (Efron & Tibshirani 1993, Ditterich 1997). In this paper, in accord with the methodology used by some authors (for example, Breiman 1998a, 1998b), we will compare the prediction capability of some estimators of f (X) on empirical known populations, drawing from them independent random samples. If we indicate with g(x t) an approximating function of the unknown function, based on n observations of a training sample t =

3 143 {y i, x i } n 1, than we use as measure of prediction error, the mean squared generalization error: PE(g t) = E Y,X (Y g(x t)) 2 (1) where E indicates the average over all values (y, x) of the population, not included in the training sample. The prediction capability of the function g(x t) can be evaluated considering the average of the sampling distribution of PE(g t) indicated by PE(g). Some recent approaches show that it is possible to build an aggregated predictor able to improve (in terms of prediction capability) the performance obtained by the single predictor. In general, single approximating functions g 1 (X t 1 ),..., g K (X t K ) are calculated on the basis of a sequence of sub-samples, t 1,...,t K generated from the training sample t (resampling, eventually with a system of weights), then they are opportunely aggregated to build an aggregated predictor. This approach produced two relevant procedures: Bagging, including its variants (Breiman 1996, 1998b, 1999), and Boosting (Drucker, 1997; Ridgeway et al., 1999). We can describe the Bagging procedure by the following steps: 1. From the training sample we draw with replacement one subsample t k t ={y i, x i } n 1 2. The predictor g k (X t k ) is calculated 3. Repeat K times Step 1 and 2 and finally the aggregated bagging predictor is calculated as: K g bagg (X t) = γ k g k (X t k ) (2) k=1 where the weight γ k is γ k 0, K k=1 γ k = 1 (usually γ k = 1/K ). The explanation given by Breiman (1996) for the reduction of PE(g) by means of the aggregated bagging predictor considers the following decomposition: EP(g) = var(ε) + E X ( f (X) ḡ(x)) 2 + E X,T (g(x T) ḡ(x)) 2 (3) where var(ε) is the noise component variance, ḡ(x) = E T (g(x T)) is the average of g(x T) over all training samples of size n. The second and third term of (3) are, respectively, the square bias and

4 144 the variance of the approximating function. The effect produced by Bagging (demonstrated in many simulation works) is to reduce considerably the variance term maintaining constant the bias. Recently, Friedman & Hall (1999) pointed out a theoretical justification for a statistical estimator of a function f (X) showing its decomposition into a linear and higher order parts. The authors deduced that the effect of Bagging is due essentially to reduced variability of the nonlinear component by replacing it with an estimate of its expected value and consequently decreasing the estimator variability. The Boosting algorithm initially have been studied for classification problems and only recently similar approaches for regression problems have been proposed. Differently from Bagging, where all cases of the training sample have the same probability of inclusion in the sub-sample, in the Boosting algorithm the probabilities of inclusion change. In particular, considering the k-th sub-sample, the probability of inclusion for the i-th case increases if the correspondent value y i was bad predicted by the approximating functions estimated on k 1 previous sub-samples. In this way the algorithm forces the sub-sample to include cases with values, till that moment, bad predicted. We considered a Boosting algorithm for regression problems proposed by Drucker (1997) (this algorithm is a modification of Adaboost proposed by Freund e Schapire for classification problems) which can be resumed as follow: 1. Firstly set the same weight w i = 1(i = 1,...,n) toall cases of the training sample. 2. A sub-sample t k is generated drawing with replacement n examples from the training sample t, where the probability of inclusion for the i-th case is fixed to p i = w i / n j=1 w j. 3. Fit on the sub-sample t k the approximating function g k (X t k ) and for all cases of the training sample obtain the predicted value of Y, ŷi k (i = 1,...,n). 4. Calculate over all cases a quadratic loss function normalised: L i (g k ) = (ŷi k y i ) 2 / j(ŷj k y j ) 2 for i = 1,...,n and finally calculate the average loss L(g k ) = n i=1 L i(g k )p i. 5. Calculate the coefficient β(g k ) = L(g k )/(1 L(g k )) that indicates the prediction capability of the approximating function g k (X t k ): a small value of β(g k ) indicates an high prediction capability.

5 Update the weights w i w i β(g k ) (1 L i (g k )) : in this way, a small loss for the i-th case corresponds to higher reduction of the weight w i and consequently to higher probability p i to be included in the successive sub-sample. 7. Repeat for K times the procedure from Step 2 to Step 6 or until the average loss is less than To obtain the aggregated boosting predictor use the weighted median: { g boost (X t)=inf y Y : log(1/β(g k )) 1 } log(1/β(g k )) (4) 2 k:g k (x t k ) y The aggregated predictor g boost weighs the K approximating functions giving higher weights to those with best fitting over all cases of the training sample. In this way, we balance the specialized prediction capability of K approximating functions on a restricted number of harder cases to their global prediction capability over all cases of the training sample. Drucker (1997) proposed in the Step 4 other two candidate loss functions: linear and exponential, but in this study we chose always the quadratic loss function. In literature the properties of Bagging and Boosting applied in regression problems have been studied prevalently as regards to the use of Regression Trees and K -Nearest Neighbor as approximating functions (Breiman, 1999; Drucker, 1997; Ridgeway et al., 1999). In this work we use as approximating function the Projection Pursuit Regression (PPR) (Friedman & Stuetzle, 1981). We define PPR as: g k (X t k ) = k M β m,k h m,k (α T m,k X t k) (5) m=1 where h m,k ( ) are smooth functions of different linear combinations of predictor variables. Increasing the number M of smooth functions the model is capable to fit functions with higher complexity. Diaconis and Shahshahani, (1984) demonstrated that for a large value of M, under very general conditions, every continuous function of X can be approximated by PPR. To estimate the PPR model we used the SMART program (Friedman, 1985); to employ a pruning strategy it is necessary to select the largest number of smooth functions, M L,touse in the search as well

6 146 as the final number of smooth functions, M F. In our applications we fixed the number of terms as M L = M F + 4. We performed simulation studies regarding the simple predictor PPR, the aggregated Bagging predictor and the aggregated Boosting predictor to estimate the values of PE(g PPR ),PE(g PPRbagg ),PE(g PPRboost ) and related variances. 3. Simulation study on real data sets Using artificial data sets originated in Friedman (1991), Borra & Di Ciaccio (2001) showed that Bagging applied to PPR can reduce considerably the prediction error and they pointed out some factors affecting the performance of Bagging. In particular they showed that: 1. Increasing K, the number of aggregated PPR calculated on the bootstrapped samples, the average of prediction error PE(g PPRbagg ) decreases until it becomes constant. 2. A rise of complexity in the approximating function resulting from an increase of M does not bring, differently from that occurs to simple PPR, to overfit the training data and consequently to have a reduced prediction capability. 3. The effect of Bagging is more evident in presence of moderatehigh noise. In this work we have considered three well known real data sets previously used in other studies and included in the machine learning repository of the University of California, Irvine: Boston Housing data, with 506 observations for each census district of the Boston metropolitan area, the median market value of owner-occupied homes as the response variable and 12 social-demographic and environmental indicators as predictor variables; Auto-MPG data, concerning the city-cycle fuel consumption in miles per gallon of 398 cars and 7 predictor variables related to mechanical characteristics, year and origin of production; Abalone data, that includes the age of 4177 examples of Abalone (the age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope) and 8 different measurements of the dimension and weight which are easier to obtain and are used to predict the age. To estimate PE(g PPR ), PR(g PPRbagg ) and PE(g PPRboost ) we applied the simulation procedure illustrated in Fig. 1.

7 147 Empirical data-set Random split Training sample 80% of units Test sample 20% of units Estimation of the methods: PPR PPRbagg PPRboost calculate: EQM for PPR EQM for PPRbagg EQM for PPRboost calculate: EP for PPR EP for PPRbagg EP for PPRboost Fig. 1. Simulation schema for real data-set Auto-MPG. In the first step we split randomly the real data set in two parts: the trainig set (including, in accordance with the same training size used in literature, 80% of cases of Auto-MPG data-set, 90% of cases of Boston Housing and 75% of Abalone) and the test set (including the remaining cases). From the training data we estimate g PPR and the aggregated predictors g PPRbagg and g PPRboost obtained from the two different resamplig techniques. Correspondently the values of the mean square error are calculated on the training data set. Then we calculate the prediction error PE on the test set for each method. This schema is repeated for 50 times with different casual splits of the data-set and finally we obtain three estimates of the average mean-squared generalization error PE. The simulation procedure is replicated with several values of M, hence the effect of several degree of complexity of PPR is considered for both Bagging and Boosting.

8 148 In figures we show the results of the simulations for Auto MPG data-set with M = 4. For each of 50 repetitions of the simulation schema, the MSE estimated on the training set and the PE estimated on the test set are showed as a function of the number of aggregations K (from 1 to 30 for Bagging, from 1 to 50 for Boosting). We can note that the average PE (line in bold type) decreases rising the number of aggregations and it is evident a reduction in the sample variability of both Bagging and Boosting though with different level. Moreover the average PE becomes stable yet for 30 aggregations using Bagging, while with Boosting is necessary to consider a larger number K (50) to obtain a good predictor. MSE n. of iterations Fig. 2. Fit of PPR-bagg on the training sample (data-set Auto MPG, M = 4). In figures 6 and 7 we consider Boston Housing data-set with M = 4 and we show the behaviour of the average of PE and the average of MSE increasing the number of aggregations for the predictors PPRbagg and PPRboost. The horizontal lines on the graphs represents the average of PE and the average of MSE for the simple PPR. In tables 1-2 we compared, for the three methods, the estimates PE of and MSE (with K = 30 for Bagging and K = 50 for Boosting) considering the Auto-MPG data set and several values of M. In the tables we indicated the variance of the values of PE and MSE obtained by simulations to compare the variability of the results obtained by three different methods.

9 149 PE n. of iterations Fig. 3. Prediction Error of PPR-bagg on the test sample (data-set Auto MPG, M = 4) MSE n. of iterations Fig. 4. Fit of PPR-boost on the training sample (data-set Auto MPG, M = 4). We can note that for the simple PPR the prediction capability, measured by means the value of PE, decreases growing the complexity parameter M, while for PPRbagg and PPRboost increases the prediction capability showing lower values of variability. The behavior of the values of MSE and PE for PPR points out that the predictor tends to overfit the data. Such disadvantage does not occur, at least for the values of M considered, with PPRbagg and with PPRboost that indeed benefit of the increase of the number of parameters of the model.

10 PE n. of iterations Fig. 5. Prediction Error of PPR-boost on the test sample (data-set Auto MPG, M = 4) PPR on test-sample PPRbagg on test-sample PPR on training-set PPRbagg on training-set n. of iterations Fig. 6. Mean values of PE and MSE for PPR and PPR-bagg (data-set Boston Housing, M = 4). In tables 3 and 4 we have showed the results concerning the same analysis applied to the Boston Housing data-set. Also in this case increasing the value of M, the fitting to the training sample rises for all three predictors, even if the PPRboost obtains best results in terms of values of MSE and variability. Respect to the prediction capability, PPR seems to reach, for M = 8, the overfitting. Differently, PPRboost and PPRbagg improve their forecast and the last achieves the best performance.

11 PPR on test-sample PPRboost on test-sample PPR on training-sample PPRboost on training-sample n. of iterations Fig. 7. Mean values of PE and MSE for PPR and PPR-boost (data-set Boston Housing, M = 4) The results of tables 5 and 6 refer to the Abalone data-set, which is different from the previous two data sets for the greater number of cases. From the tables we can note that, growing M, the fitting of PPR on the training sample improves but without overfitting. Also for this data set PPRbagg and PPRboost result better than PPR both in training and prediction capability. From the point of view of computational complexity of the simulations carried out, for each data-set we applied PPR 6000 times to obtain the estimation of PE for PPRbagg and times to obtain the estimation of PE for PPRboost. Using the SMART program to estimate PPR we fixed the internal parameters in such way that the algorithm did not carry out strategies of pruning. Table 1: Estimation of PE onauto MPG for several values of M. PPR PPR-bagg PPR-boost PE var. PE var. PE var. M = M = M = M =

12 152 Table 2: Estimation of MSE onauto MPG for several values of M. PPR PPR-bagg PPR-boost MSE var. MSE var. MSE var. M = M = M = M = Table 3: Estimation of PE onboston Housing for several values of M. PPR PPR-bagg PPR-boost PE var. PE var. PE var. M = M = M = M = Table 4: Estimation of MSE onboston Housing for several values of M. PPR PPR-bagg PPR-boost MSE var. MSE var. MSE var. M = M = M = M = Table 5: Estimation of PE onabalone for several values of M. PPR PPR-bagg PPR-boost PE var. PE var. PE var. M = M = M = M =

13 153 Table 6: Estimation of MSE onabalone for several values of M. PPR PPR-bagg PPR-boost MSE var. MSE var. MSE var. M = M = M = M = Conclusions The results obtained by simulation studies on real data sets confirm the effectiveness of Bagging when applied in conjunction with nonparametric regression methods like PPR. This resamplig technique allows to improve both the fitting of the training sample and the prediction capability of PPR, avoiding problems connected to overfitting, also in presence of a large number of parameters in the model. The results obtained brings us to a positive evaluation of the method, even if the number of replications is not sufficient to obtain very accurate estimates. The comparison between Boosting and Bagging does not give remarkable differences, although the last method gives, for our datasets, better performance in the prediction phase. Never the prediction capability of Boosting has been remarkable better than Bagging, but the fit to the training sample is usually the best. This results are different with respect to other comparisons obtained using Regression Trees (Drucker 1997). However it is necessary to consider that we used the boosting approach of Drucker, but there are also other proposals in literature for the application of Boosting with regressive methods. Therefore we can not exclude that better results may be obtained considering another proposal. Alternatives approaches were proposed, for example, by Ridgeway et al. (1999), difficult to implement from a computational point of view, or by Friedman (1999) based on the gradient descendent. With regard to the comparison with other analyses using the same data-sets, it can be noticed that the application of PPR with Bagging proved in many cases more effective as regards to Regression Tree

14 154 with Bagging (see Breiman 1999 as an example). After all it can be concluded in the light of the obtained results that both for real data and simulated data (Borra & Di Ciaccio, 2001) the application of Bagging to the Projection Pursuit Regression proved to be effective both in training phase and in prediction phase. REFERENCES Bauer, E. and Kohavi, R. (1998) An empirical comparison of voting classification algorithms: Bagging, Boosting, and variants, Machine Learning, vv:1-38. Borra, S. and Di Ciaccio, A. (2001) Reduction of prediction error by bagging projection pursuit regression, in Advances in Classification and Data Analysis, Borra S., Rocci R., Vichi M., Schader M. eds., Springer Verlag. Breiman, L. (1996) Bagging predictors, Machine Learning, 26, n.2: Breiman, L. (1998a) Arcing Classifier, discussion paper, Annals of Statistics, 26: Breiman, L. (1998b) Half and half bagging and hard boundary points, Technical Report n. 534, Statistics Department, University of California, Berkeley. Breiman, L. (1999) Using adaptive bagging to debias regressions, Technical Report n. 547, Statistics Department, University of California, Berkeley. Diaconis, P. and Shahshahani, M. (1984) On non linear functions of linear combinations, SIAM, J. Sci. Statist. Comput., 5: Dietterich, T. G. (1997) Machine Learning Research: Four Current Directions, A.I. Magazine, 18(4): Dietterich, T. G. (1998) An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, Boosting, and Randomization, Machine Learning: Drucker, H. (1997) Improving Regressors using Boosting Techniques, in Proceedings of the Fourteenth International Conference on Machine Learning, ed. Fisher D.H., Jr., Morgan-Kaufmann: Drucker, H. and Cortes, C. (1996) Boosting decision trees, Advances in Neural Information Processing Systems, 8: Efron, B. and Tibshirani, R.J. (1993) An Introduction to the Bootstrap, Chapman & Hall, New York. Freunde, Y. and Schapire, R. (1996) Experiments with a new boosting algorithm, in Machine Learning: Proceedings of the Thirteenth International Conference: Freunde, Y. and Schapire, R. (1997) A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences, 55(1): Friedman, J.H. (1985) Classification and multiple regression through projection pursuit, Technical Report n. 12, Dept. of Statistics, Stanford University.

15 155 Friedman, J.H. (1991) Multivariate adaptive regression splines, The Annals of Statistics, 19: Friedman, J.H. (1999) Greedy Function Approximation: a Gradient Boosting Machine, Technical report, Dept. of Statistics, Stanford University: in http: //www-stat.stanford.edu/ jhf/. Friedman, J.H. and Hall, P. (1999) On Bagging and Nonlinear Estimation, in http: //www-stat.stanford.edu/ jhf/. Friedman, J.H. and Stuetzle, W. (1981) Projection Pursuit Regression, Journal of the American Statistical Association, 76: Quinlan, J. R. (1996) Bagging, boosting, and C4.5, in Proceedings of the Thirteenth National Conference on Artificial Intelligence, Ridgeway, G., Madigan, D. and Richardson, T. (1999) Boosting methodology for regression problems, in Proceedings of the 7th International Workshop on Artificial Intelligence and Statistics, January 3-6, Florida. Schwenk, H. and Bengio, Y. (1998) Training methods for adaptive boosting of neural networks for charater recognition, in Advances in Neural Information Processing Systems, 10. Performance evaluation of Bagging and Boosting in nonparametric regression Summary In this paper we consider two recently proposed techniques, Bagging and Boosting, devoted to the reduction of prediction error in classification and regression problems. In the analysis of artificial and real data-set they proved to be particularly capable combined with Classification and Regression Trees and with Neural Networks. In this paper we test and compare these techniques combined with Projection Pursuit Regression. The results obtained in the analysis of simulations on real data-set show the capability of these techniques to reduce the prediction error of Projection Pursuit Regression. In these simulations Bagging performed better than Boosting, contrary to what was observed with other non-parametric methods. Valutazione della performance del Bagging e del Boosting nella regressione non parametrica Riassunto In questo lavoro vengono considerate due tecniche proposte recentemente in letteratura, il Bagging e il Boosting, rivolte alla riduzione dell errore di previsione in problemi di classificazione e di regressione. Da studi condotti su dati simulati o reali, tali tecniche sono risultate particolarmente efficaci in congiunzione con i Classification e Regression Trees e con i Neural Networks. In questo lavoro vengono sperimentate e confrontate queste tecniche in congiunzione con la Projection Pursuit Regression. I risultati ottenuti

16 156 da simulazioni su dati reali mostrano la loro capacità nel ridurre l errore di previsione della Projection Pursuit Regression. Nelle simulazioni effettuate il Bagging risulta più efficace del Boosting, contrariamente a quanto osservato in congiunzione con altri metodi non-parametrici. Key words Nonparametric regression; Bagging; Boosting; Projection pursuit regression; Prediction error; Pruning. [Manuscript received October 2000; final version received March 2001.]

Voting (Ensemble Methods)

1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers