Computational Statistics Canonical Forest

Size: px

Start display at page:

Download "Computational Statistics Canonical Forest"

Ethel Stewart
5 years ago
Views:

1 Computational Statistics Canonical Forest --Manuscript Draft-- Manuscript Number: Full Title: Article Type: Keywords: Corresponding Author: COST-D R Canonical Forest Original Paper Canonical linear discriminant analysis; Classification; Ensemble; Linear discriminant analysis; Rotation Forest Hongshik Ahn, Ph.D. SUNY Stony Brook Stony Brook, NY UNITED STATES Corresponding Author Secondary Information: Corresponding Author's Institution: SUNY Stony Brook Corresponding Author's Secondary Institution: First Author: Yu-Chuan Chen First Author Secondary Information: Order of Authors: Yu-Chuan Chen Hyejung Ha Hyunjoong Kim, Ph.D. Hongshik Ahn, Ph.D. Order of Authors Secondary Information: Abstract: We propose a new classification ensemble method named Canonical Forest. The new method uses canonical linear discriminant analysis (CLDA) and bootstrapping to obtain accurate and diverse classifiers that constitute an ensemble. We note CLDA serves as a linear transformation tool rather than a dimension reduction tool. Since CLDA will find the transformed space that separates the classes farther in distribution, classifiers built on this space will be more accurate than those on the original space. To further facilitate the diversity of the classifiers in an ensemble, CLDA is applied only on a partial feature space for each bootstrapped data. To compare the performance of Canonical Forest and other widely used ensemble methods, we tested them on real or artificial data sets. Canonical Forest performed significantly better in accuracy than other ensemble methods in most data sets. According to the investigation on the bias and variance decomposition, the success of Canonical Forest can be attributed to the variance reduction. Powered by Editorial Manager and ProduXion Manager from Aries Systems Corporation

2 *Authors' Response to Reviewers' Comments Click here to download Authors' Response to Reviewers' Comments: CS_response_00.docx Response to the reviewers Title: Canonical Forest We appreciate your review of our manuscript. We have revised the manuscript accordingly. Our response to the reviewers comments is given below. The page numbers and line numbers are based on the revised manuscript. Response to the comments of Co-editor 1. At the end of your introduction, please be specific and indicate in which of the following sections the listed content will be mentioned. Also mention here that the pseudo-code can be found in the two appendices. Response: We have added a brief introduction of each section in the last paragraph of Section 1 on page.. In addition to Fig (mentioned by the reviewer - these are far too small!), I think that Fig 1 also could deserve slightly larger labels for both axes. Response: We have enlarged the labels for both axes in Fig 1.. For Fig, when you follow the comment from the reviewer, it may be a good idea to consider the interval (0, 1) for kappa and perhaps (0, 0.) for the error rates. Response: We have fixed the interval (0, 1) for kappa and (0, 0.) for the error rates throughout the plots in Fig.. When you redo this plot, correct Lengend to Legend and enlarge the font size a little bit. Moreover, you can reduce the spacing between the figures via the par() command in R. Response: We have corrected this typo and also enlarged the font size. Response to the comments of Reviewer 1. Can you increase the size of the labels and axis labels in figure? The font size is too small. Response: We have enlarged both the size of labels and axes in Fig.. In figure, how about changing the range on the x-axis and y-axis to be the same throughout the plots so that we can compare across the data sets? Response: We have fixed the interval (0, 1) for kappa and (0, 0.) for the error rates throughout the plots in Fig.

3 *Manuscript Click here to view linked References Computational Statistics manuscript No. (will be inserted by the editor) Canonical Forest Yu-Chuan Chen Hyejung Ha Hyunjoong Kim Hongshik Ahn Abstract We propose a new classification ensemble method named Canonical Forest. The new method uses canonical linear discriminant analysis (CLDA) and bootstrapping to obtain accurate and diverse classifiers that constitute an ensemble. We note CLDA serves as a linear transformation tool rather than a dimension reduction tool. Since CLDA will find the transformed space that separates the classes farther in distribution, classifiers built on this space will be more accurate than those on the original space. To further facilitate the diversity of the classifiers in an ensemble, CLDA is applied only on a partial feature space for each bootstrapped data. To compare the performance of Canonical Forest and other widely used ensemble methods, we tested them on real or artificial data sets. Canonical Forest performed significantly better in accuracy than other ensemble methods in most data sets. According to the investigation on the bias and variance decomposition, the success of Canonical Forest can be attributed to the variance reduction. Keywords Canonical linear discriminant analysis; Classification; Ensemble; Linear discriminant analysis; Rotation Forest 1 Introduction Classification is a procedure to build a model using the class labels and their features (predictor variables) in a training dataset and then to predict the class labels of future instances. Current popular classification techniques include decision trees, logistic regression, neural networks, support vector machines, and linear discriminant analysis. The concept of combining many classifiers, known as classification ensemble, was proposed to improve the overall classification accuracy. A classification ensemble summarizes the predictions from multiple classifiers through majority voting or weighted voting to form the final prediction. It is known that a classification ensemble that consists of several weak classifiers would improve the classification accuracy (Ji and Ma 1; Hastie et al. 001). Here, a weak classifier is defined as the one that performs slightly better than random guess. The accuracy of a classification ensemble can be improved further by diversifying the classifiers that constitute an ensemble. Two ensemble methods, Boosting (Schapire 10; Freund and Schapire 1; 1) and Bagging (Breiman 1), have been widely used for that purpose. Recently, Rotation Forest (Rodríguez et al. 00) has been proposed to further diversify the classifiers in an ensemble by using principal component analysis. All these methods use multiple re-sampled or re-weighted training dataset to build diverse classifiers as described below. Boosting changes the distribution of the training dataset during the construction of ensemble, then combines the classifiers using weighted voting. Higher weights will be assigned to those classifiers with better classifica- Yu-Chuan Chen Hongshik Ahn ( ) Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY -00 SUNY Korea, Incheon 0-0, South Korea Phone: 1-- Fax: hongshik.ahn@stonybrook.edu Hyejung Ha Hyunjoong Kim Department of Applied Statistics, Yonsei University, Seoul 10-, South Korea

4 tion accuracy. Adaboost (Freund and Schapire 1; 1) is the most widely used boosting method. Samme (Zhu et al. 00) is a new algorithm that naturally extends Adaboost to the multiclass case without reducing it to multiple two-class problems. Recently, a new boosting algorithm that is specifically designed for the asymmetric mislabeled data named Asymmetric η-boost was proposed by Hayashi (01). Asymmetric η-boost is a generalization algorithm of Adaboost that adopts the asymmetric η-loss function making the boosting algorithm more resistant to asymmetric mislabeled data. Bagging uses bootstrapped training data at each stage of ensemble to build diverse classifiers. A simple majority voting is used to combine the fitted classifiers to form the final classification. Random Forest (Breiman 001), a variant version of Bagging, is a decision tree-based ensemble method. It randomly uses a feature subset instead of all the features to find a split in growing a decision tree. Denote p as the number of all the features. Square root of p is the default feature subset size of a node in the R Random Forest package named randomforest. Ahn et al. (00) observed that this default has consistently good results in many data sets. Double-Bagging (Hothorn and Lausen 00) is a CLDA based ensemble method that combines the algorithms of CLDA and Bagging. For each base classifier, Double-Bagging first takes a bootstrap sample from training data and then applies CLDA to the out of bag (OOB) sample to obtain the coefficient matrix. The canonical features can be computed using this coefficient matrix in the bootstrap sample. The base classifier is then built using both the original features and the canonical features. Finally, a simple majority voting is used to combine all these base classifiers. Like Bagging and Random Forest, Rotation Forest (Rodríguez et al. 00) fits all the classifiers in parallel. To fit a classifier in an ensemble, it randomly splits the training data of p features into K subset with roughly m = p/k features. Principal component analysis (PCA) is applied for each of K subset using the bootstrapped data within the subset. Then a rotation matrix containing the coefficients obtained from PCA is multiplied to the original training data to obtain a rotated training data. Finally a classifier is fitted using the rotated training data. This process is repeated to constitute an ensemble. A majority voting is used to combine the fitted classifiers. Kestler et al. (0) proposed an ensemble method by combining many simple threshold classifiers named rays (Anthony and Biggs 1) using two different voting schemes: majority voting and conjunction (unanimity voting). They also use two different approaches to generate these rays. One approach is using univariate feature selection evaluated by area under the ROC curve (AUC) and another approach is using greedy feature selection. In this study, we developed a new ensemble method called Canonical Forest using majority vote. We explained the entire algorithm of Canonical Forest in Section, and provided the pseudo code in the Appendices. In Section, we show the results of comparing Canonical Forest with current popular ensemble methods including Bagging, Random Forest, Adaboost, Samme, and Rotation Forest. In Section, we investigate the bias and variance decomposition for each method. In Section, we provide a discussion. Method.1 Feature Extraction and CLDA For high-dimensional data with redundant information, the data can be transformed into a reduced set of features which is called feature extraction. If the feature extraction is successful, then the set of new features should contain most of the information in a dataset. PCA is the most widely used method in feature extraction. However, when the data consist of several classes, canonical linear discriminant analysis (CLDA) is a better approach because PCA does not utilize the class information. The general idea of CLDA is to find a linear combination to maximize the between-class variance relative to the within-class variance (Hastie et al. 001). That is, it finds a linear transformation of features which separates the class distribution as far as possible. In this sense, classification using the features extracted by CLDA performs better compared to the one using original features. The algorithm of CLDA modified from Hastie et al. (001) is given in Appendix 1.. Canonical Forest The main difference between Canonical Forest and Rotation Forest (Rodríguez et al. 00) is on the method for feature extraction. CLDA is used for Canonical Forest, while PCA is used for Rotation Forest.

5 Let x = [x 1,...,x p ] T be an instance represented by p features and X be the training data composed of n instances in a form of an n p matrix. Let Y be the class labels (1,...,C) of the training data where Y = [y 1,...,y n ] T. The classifiers in an ensemble are denoted by L 1,...,L B and the feature set is denoted by F. To set up the training data for classifier L i, we use the following steps. 1. Randomly split F into K subsets. The subsets are made disjoint to enhance the diversity of an ensemble. For simplicity, assume that p is divisible by K. Then each feature subset contains m = p/k features. If p is not divisible by K, we let m = [p/k] + 1 and the last subset is set to contain the remaining features.. Denote F i, j as the jth subset of features of the training data for classifier L i. For each of such subsets, use bootstrap resampling with % of the original sample size. Run CLDA on F i, j and obtain a coefficient matrix A i, j = [A (1) i, j,a() i, j,...,a(m) i, j ] of size m m. Note that each of A (1) i, j,a() i, j,...,a(m) i, j is of size m because all the canonical components are kept without dimension reduction.. Arrange the obtained coefficient matrix A i, j into a p p block diagonal matrix R i, where R i = A (1) i,1,a() i,1,...,a(m) i, A (1) i,k,a() i,k,...,a(m) i,k. Construct the rotation matrix R a i (an p p matrix) by rearranging the rows of R i so that they correspond to the original features in F.. The new training data for classifier L i is (XR a i,y ). The pseudocode for Canonical Forest is given in Appendix. Since the centroids of C classes in p-dimensional input space span at most C 1 dimensional subspace, it is common to extract C 1 features in CLDA for dimension reduction in general. However, we use CLDA as a linear transformation tool rather than a dimension reduction tool for this study. Therefore, we let the number of extracted features to be the number of the original features. This results in m extracted features in each subset of features even if m is greater than C 1. Although these extra features may not contribute much to the discriminatory power, they will encourage the classifiers to be more diverse and it can yield higher accuracy of the ensemble. Results An experiment is conducted using real or artificial data sets to compare Canonical Forest with Bagging, Adaboost, Samme, Random Forest, and Rotation Forest. Decision trees were used as the base classifiers for all the ensemble methods. We used unpruned decision trees as base classifiers except for Adaboost and Samme. For Adaboost and Samme, we set the maximum depth of each single tree to be the number of classes, which is the default setting in the R package called adaboost.m1. A decision tree program called rpart available in the R package is used for the experiment. Rpart is based on the CART (Classification and Regression Trees) algorithm (Breiman et al. 1). In Random Forest, we set the number of features chosen at each node equal to the square root of the number of all features, which is the default setting in the R Random Forest package named randomforest. In Canonical Forest and Rotation Forest, the number of features in each subset is set to be m =. If p is not divisible by, then the last subset was completed with the remaining features (1 or features). The data are summarized in Table 1. Most of the data come from UCI Data Repository (Asuncion and Newman 00) and package mlbench (Leisch and Dimitriadou 0) of the R library. Since PCA and CLDA cannot be applied to discrete features, all the discrete features have been removed from each data set. To compare the accuracy among different ensemble methods for ensemble sizes,,, 1,, and, twenty repetitions of -fold cross-validation were performed for each of the data set. A paired t-test was used for each data to measure the statistical significance between Canonical Forest and the others. Figure 1 shows the boxplots comparing Canonical Forest with the other classification methods using statistics from the paired t-test. A positive paired t-test statistic indicates that Canonical Forest performs better. Figure 1 shows that the t-test statistic tends to increase as the ensemble size increases for a single tree and Bagging; t-test statistic tends to decrease as ensemble size increases for Random Forest; no obvious trend is found for Adaboost, Samme, and Rotation Forest. In general, we observed that Canonical Forest performs better than any other ensemble methods at each ensemble size.

6 Table 1 Description of the data sets. Data Set Observations Continuous Class Source Features aba 1 UCI aus 0 UCI bld UCI bod 0 Heinz et al. (00) bos 0 1 UCI cir 00 R library dia Loh (0) ech 1 UCI fis Kim and Loh (00) hea 0 UCI int 00 Kim et al. (0) ion 1 UCI iri 0 UCI lak Loh (0) led 000 UCI mam 1 R library pid UCI pks 1 R library pov Kim and Loh (001) rng 00 R library sea 000 Terhune (1) snr 0 0 R library spe UCI trn 00 R library twn 00 R library usn 0 Statlib (0) veh 1 UCI vol 1 Loh (0) vow 0 UCI To compare the accuracies of methods in a large ensemble size, we fixed the ensemble size at B = and 00. The average accuracies are shown in Tables and. The comparison was done by using paired t-test at a two-sided significance level of α = 0.0. We also compared the performance of LDA with the performance of Canonical Forest and found that Canonical Forest (ensemble size is ) is significantly better than LDA in most of the data sets (1 significantly better and significantly worse). Since our focus in this paper is to show that Canonical Forest is quite a comparable ensemble method comparing to other current popular ensemble methods, we decided not to include LDA in Tables and to make the tables more concise. The reason why we included trees in Tables and is because trees are the base classifiers in all ensemble methods and therefore are listed here as a reference. Tables and show summaries of the comparisons given in Tables and, respectively. The entry a i j shows the frequency that the method in column ( j) is more accurate than the method in row (i). The number in the parentheses shows the frequency that these differences are statistically significant. In Table, for example, the value in row, column is (1). This means that Canonical Forest was more accurate than Bagging in of the comparisons and less accurate in comparisons. The number in the parentheses indicates that Canonical Forest was significantly better than Bagging in 1 data sets. In of the remaining cases, Bagging was significantly better than Canonical Forest (the entry in row, column is ()). Tables and present the rankings of the methods based on the frequency that each method was significantly more accurate and significantly less accurate than other methods. In Table, for example, the number of wins for Canonical Forest is. It is obtained by the sum of the numbers in the parentheses in the column of Canonical Forest in Table. Similarly, the number of losses is obtained by the sum of the numbers in the parentheses in the row of Canonical Forest which is. Therefore, the dominance rank of Canonical Forest is = at ensemble size B =. Tables through show that Canonical Forest outperformed other widely used ensemble methods. The dominance ranks of Canonical Forest ( and ) were substantially larger than that of the second best at both ensemble sizes. To confirm that the superiority of Canonical Forest is not just by chance, we performed the exact Binomial test on the classification accuracy of Tables and. The one-sided test was adopted to get the p-values for the

7 Table Classification accuracy of tree and ensemble methods compared with Canonical Forest at ensemble size B =. Data Set Tree Bagging Adaboost Samme Random Rotation Canonical Forest Forest Forest aba aus bld bod bos cir dia ech fis hea int ion iri lak led mam pid pks pov rng sea snr spe trn twn usn veh vol vow (win/tie/loss) (1/1/) (//1) (//0) (//1) (//1) (//) + Canonical Forest is significantly better, - Canonical Forest is significantly worse, level of significance 0.0 alternative hypothesis that Canonical Forest performs better than another method. Table shows the p-values of the exact Binomial test. We used Holm-Bonferroni (Holm 1) correction to adjust the significance level for a multiple test to maintain the overall type I error rate of α = 0.0. The six adjusted α, from the smallest to the largest, would be.00,.01,.0,.01,.0, and.0. The p-values arranged in increasing order are compared with the Holm-Bonferroni adjusted α values. The Holm-Bonferroni multiple comparison is performed sequentially beginning with the smallest p-value. As a result, for the ensemble size of B =, Canonical Forest is significantly more accurate than the other methods. Except Samme, for the ensemble size of B = 00, Canonical Forest is also significantly more accurate than the other methods. The p-value for the comparison with Samme, however, is quite close to the adjusted significance level of α = 0.0. We also conducted an experiment on investigating the sample size as a factor of the gain of Canonical Forest vs other methods using the randomly selected data sets: aus, bos, dia, pid, snr, trn, vol, and vow. For each data set, the bootstrap sampling method was used to generate each different sample size. Ten repetitions were conducted for each sample size and then the average was taken. There are no obvious trends found when comparing Canonical Forest with Tree, Bagging, Random Forest and Rotation Forest. However, when comparing Canonical Forest with Adaboost and Samme, the gain of Canonical Forest versus each of these two methods tends to decrease as the sample size increases for most of these data sets. In addition to the accuracy, we investigated the diversity of each ensemble method. The kappa statistic κ (Cohen 10) is a measure of agreement between two categorical variables. The agreement between two classifiers L i and L j can be measured by κ (Kuncheva and Whitaker 00; Rodríguez et al. 00). An ensemble with B classifiers will have B(B 1)/ pairs of classifiers (L i,l j ). Higher diversity among classifiers in an ensemble would produce small κ statistics. Figure shows the κ-error diagrams for selected data set. The κ-error diagrams of the other data set are similar to one of these plots. For each κ-error diagram, the x-axis indicates the kappa statistic κ and the y-axis indicates the averaged error of L i and L j, denoted as E i, j = (E i + E j )/, where E i and E j are the error rates of L i and L j, respectively. Since small values of κ indicate that the classifiers are more diverse and small values of E i, j

8 Table Classification accuracy of tree and ensemble methods compared with Canonical Forest at ensemble size B = 00. Data Set Tree Bagging Adaboost Samme Random Rotation Canonical Forest Forest Forest aba aus bld bod bos cir dia ech fis hea int ion iri lak led mam pid pks pov rng sea snr spe trn twn usn veh vol vow (win/tie/loss) (1/1/) (//1) (//1) (//1) (//1) (//) + Canonical Forest is significantly better, - Canonical Forest is significantly worse, level of significance Some features of data set hea were removed due to the computational problems Table Summary of comparisons among different methods with ensemble size B =. Tree Bagging Adaboost Samme Random Rotation Canonical Forest Forest Forest Tree - () () (1) () () () Bagging (0) - () () 0 () 1 () (1) Adaboost () () - () 1 () 1 () (0) Samme () () () - 1 () 1 () 1 (1) Random Forest () () () () - (1) (1) Rotation Forest (1) 1 () () 1 () () - 1 () Canonical Forest 1 (1) () () () () () - The entry a i j shows the frequency that the method in column ( j) is more accurate than the method in row (i). The number in parentheses shows the frequency that these differences are statistically significant. indicate a high accuracy, the ideal location of dots is in the lower left corner. Five hundred classifiers were fitted using training data, then the result of the test data were used to calculate the κ statistics and the error rates. By using the κ-error diagram, we can see the relative position of the ensemble methods for each data set. To help read Figure, take data set aus as an example. Samme has the largest diversity but also has the highest error rate. Canonical Forest has the lowest error rate (the position of Canonical Forest is lower on the y-axis) but slightly less diverse (the position of Canonical Forest is farther on the x-axis) than the other four ensemble methods. Figure shows contour plots of κ-error diagrams for the six representative data sets. We used contour plots to make the differences look clear. Taking data set fis for example, Canonical Forest is on the lower right corner due to its lower error rate and less diversity than the other ensemble methods; Samme is on the upper left corner due to its higher error rate and higher diversity. Note that Samme is not shown in the plots for bod, bos, and sea because of very high error rate and diversity. Canonical Forest appears to outperform the other ensemble methods in terms of accuracy while the diversity is quite similar to that of Bagging and Rotation Forest and less diverse

9 Table Summary of comparisons among different methods with ensemble size B = 00. Tree Bagging Adaboost Samme Random Rotation Canonical Forest Forest Forest Tree - () () 1 (1) () () () Bagging (0) - (1) () 1 (1) 1 () (1) Adaboost () () - () 1 () 1 () 1 (1) Samme () 1 () () - 1 (1) 1 () 1 (1) Random Forest (1) () () () - () 1 (1) Rotation Forest (1) () () () () - 1 () Canonical Forest 1 (1) () () () () () - The entry a i j shows the frequency that the method in column ( j) is more accurate than the method in row (i). The number in parentheses shows the frequency that these differences are statistically significant. Table Dominance ranks of the methods using the significant differences from the results in Table. Method Dominance rank Wins Losses (Wins-Losses) Canonical Forest Rotation Forest Random Forest Samme 0 Bagging - Adaboost - Tree - 1 Table Dominance ranks of the methods using the significant differences from the results in Table. Method Dominance rank Wins Losses (Wins-Losses) Canonical Forest Random Forest 1 Rotation Forest Adaboost Samme - 0 Bagging - Tree -1 Table P-values of exact Binomial test for comparing Canonical Forest with other methods. Adjusted α Method 1 Tree Bagging Adaboost Random Forest Samme Rotation Forest p-value Method Tree Bagging Adaboost Random Forest Rotation Forest Samme p-value The result is significant by using Holm Bonferroni Correction at α = p-value was obtained for comparing Canonical Forest with other methods at ensemble size of p-value was obtained for comparing Canonical Forest with other methods at ensemble size of 00 than Samme and Random Forest. This suggests that if we can increase diversity of Canonical Forest, it could be a more powerful ensemble method.

10 Fig. 1 Boxplots of paired t-test statistics between Canonical Forest and others. The x-axis is the ensemble size, B, and the y-axis indicates the paired t-test statistic. CanF stands for Canonical Forest; RnF stands for Random Forest; RotF stands for Rotation Forest.

11 Bagging Samme Random Forest Rotation Forest Canonical Forest aus bos fis hea pks rng Fig. κ-error diagram for selected data sets. x-axis = κ, y-axis = E i, j (average error of the pair of classifiers).

12 Fig. Contour plot of κ-error diagrams for selected data sets.

13 Bias and Variance Decomposition The bias and variance decomposition of the error (Geman et al. 1) is a popular and useful approach. The bias measures the distance between prediction of the classifier and the target function and the variance measures variation among the predictions from different classifiers. Several authors, including Kong and Dietterich (1), Kohavi and Wolpert (1), and Breiman (1), have proposed different methods for the bias and variance decomposition. In this study we used the decomposition proposed by Kohavi and Wolpert (1). The same data sets as in Section (except data set aus due to a computational problem) were used to analyze the bias and variance decomposition. We used -fold cross-validation to estimate the bias and variance. An ensemble of size 00 is used for each ensemble method. Table shows the comparison of bias. To help read Win - Loss in Table, we take Samme as an example. The Win - Loss is 1 - for Samme when compared to Canonical Forest. This means that the bias of Samme is smaller than that of Canonical Forest for 1 data sets and larger for data sets. Samme appears to be the best among all the ensemble methods considered here in reducing bias while Canonical Forest is in the middle. Table compares the variance. It shows that Canonical Forest dominates all the other methods in reducing the variance. In Tables and, the p-values were obtained by performing one-sided Wilcoxon signed-rank test. We found that the high accuracy of Canonical Forest shown in the previous section was mainly due to the variance reduction. Table Comparison of contribution of Bias to error. Data Set Tree Bagging Adaboost Samme Random Rotation Canonical Forest Forest Forest aba bld bod bos cir dia ech fis hea int ion iri lak led mam pid pks pov rng sea snr spe trn twn usn veh vol vow Win - Loss p-value

14 Table Comparison of contribution of Variance to error. Data Set Tree Bagging Adaboost Samme Random Rotation Canonical Forest Forest Forest aba bld bod bos cir dia ech fis hea int ion iri lak led mam pid pks pov rng sea snr spe trn twn usn veh vol vow Win - Loss p-value Discussion We introduced a new ensemble method called Canonical Forest. It uses CLDA to perform a linear transformation on the original input data so that the transformed input data can be more distinct among classes. To enhance the diversity, the features are split into subsets, then CLDA is applied on each subset separately. It should be noted that disjoint subsets would yield enhanced diversity because there is no overlap between subsets. However, the number of subsets is not necessarily related to the diversity of classifiers in Canonical Forest. A classifier is built using all the canonical components when applying CLDA. Here CLDA serves as a linear transformation tool rather than a dimension reduction tool since we keep all the features while applying CLDA. The reason why we chose CLDA is because its a supervised learning method. When linearly transforming the data, CLDA utilizes the class information and makes the classes more separable after transformation. Therefore, this makes CLDA a better linear transformation tool than other dimension reduction methods when applied in classification analysis. Canonical Forest is formed by combining these classifiers with a majority vote. Although both Double-Bagging and Canonical Forest are CLDA based ensemble methods, they are different due to the fact that CLDA is applied to OOB sample in Double-Bagging while CLDA is applied to bootstrap sample in Canonical Forest. Besides, CLDA is applied to the whole feature space in Double-Bagging while CLDA is applied to each mutually exclusive feature subset in Canonical Forest. Canonical Forest performed better in terms of accuracy than the other widely used ensemble methods especially when the ensemble size is small based on our experiment. Exact Binomial test showed the superiority of Canonical Forest over other methods. Through the investigation of bias and variance decomposition, we found that the reduction of variance played a major role in improving the accuracy of Canonical Forest. The gap in performance between Canonical Forest and the other methods decreased a little as the ensemble size increased. By investigating the κ-error diagram, we found that this is because Canonical Forest is slightly less diverse than the other ensemble methods. Nevertheless, Canonical Forest still showed a better performance in terms of the classification accuracy when it was compared to the other methods with the ensemble size of 00 which is nearly optimal ensemble size for the other methods. 1

15 In this experiment, the number of features in each subset was set to be m = for both Rotation Forest and Canonical Forest because this was the parameter setting used in the experiment in Rotation Forest (Rodríguez et al. 00). We used the same parameter setting to make a consistent comparison. We will also further investigate the choice of m on the performance of Canonical Forest in future study like Kuncheva and Rodríguez (00) did. It also should be noted that although we removed the discrete features from all the data in this experiment because CLDA and PCA cannot be applied to discrete features, it is acceptable to apply CLDA or PCA to ordinal discrete features by treating them as continuous features. Since Canonical Forest is a CLDA based ensemble method, the performance of Canonical Forest may depend on the performance of CLDA. Therefore, Canonical Forest will perform better in the situations where CLDA performs better. In general, CLDA works better when the discriminatory information is in the mean instead of the variance of the data. Besides, CLDA also works better in balanced data (i.e., the number of instances for each class is roughly equal) than in unbalanced data because it needs representative instances for each class to make a better separation among classes. Acknowledgement Hyunjoong Kim s work was partly supported by Basic Science Research program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education, Science, and Technology(01R1A1A01). Hongshik Ahn s work was partially supported by the IT Consiliance Creative Project through the Ministry of Knowledge Economy, Republic of Korea. References Ahn H, Moon H, Fazzari MJ, Lim N, Chen JJ, Kodell RL (00) Classification by ensembles from random partitions of high-dimensional data. Comput Stat Data Anal 1:1-1 Anthony M, Biggs N (1) Computational learning theory. Cambridge University Press, Cambridge Asuncion A, Newman DJ (00) UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Science. mlearn/mlrepository.html Breiman L (1) Bagging predictors. Mach Learn :1-0 Breiman L (1) Arcing classifiers. Ann Stat :01- Breiman L (001) Random forest. Mach Learn :- Breiman L, Friedman JH, Olshen RA, Stone CJ (1) Classification and Regression Trees. Wadsworth, Belmont, CA Cohen J (10) A coefficient of agreement for nominal scales. Educ Psychol Meas 0(1): - Freund Y, Schapire R (1) Experiments with a new boosting algorithm. In: Proceedings of the Thirteenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco, pp - Freund Y, Schapire R (1) A decision-theoretic generalization of online learning and an application to boosting. J Comput Syst Sci :- Geman S, Bienenstock E, Doursat R (1) Neural networks and the bias/variance dilemma. Neural Comput :1- Hastie T, Tibshirani R, Friedman J (001) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Verlag, New York Hayashi K (01) A boosting method with asymmetric mislabeling probabilities which depend on covariates. Comput Stat :0-1 Heinz G, Peterson LJ, Johnson RW, Kerk CJ (00) Exploring relationships in body dimensions. J Stat Educ. publications/jse/vn/datasets.heinz.html Holm S (1) A simple sequentially rejective multiple test procedure. Scand J Stat :-0 Hothorn T, Lausen B (00) Double-Bagging: Combining classifiers by bootstrap aggregation. Pattern Recognit :0-0 Ji C, Ma S (1) Combinations of weak classifiers. IEEE Trans Neural Netw (1):- Kestler HA, Lausser L, Linder W, Palm G (0) On the fusion of threshold classifiers for categorization and dimensionality reduction. Comput Stat :1-0 Kim H, Loh WY (001) Classification trees with unbiased multiway splits. J Am Stat Assoc :-0

16 Kim H, Loh WY (00) Classification trees with bivariate linear discriminant node models. J Comput Graph Stat 1:1-0 Kim H, Kim H, Moon H, Ahn H (0) A weight-adjusted voting algorithm for ensemble of classifiers. J Korean Stat Soc 0:- Kohavi R, Wolpert DH (1) Bias plus variance decomposition for zero-one loss functions. In: Proceedings of the Thirteenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco, pp - Kong EB, Dietterich TG (1) Error-correcting output coding corrects bias and variance. In: Proceedings of the Twelfth International Conference on Machine Learning. Morgan Kaufmann, San Francisco, pp -1 Kuncheva LI, Rodríguez JJ (00) An experimental study on rotation forest ensembles. In: Haindl H, Kittler J, Roli F (eds) Multiple Classifier Systems. Springer, Berlin, pp - Kuncheva LI, Whitaker CJ (00) Measures of diversity in classifier ensembles. Mach Learn 1:11-0 Leisch F, Dimitriadou E (0) mlbench: Machine Learning Benchmark Problems. R package version.0-0 Loh WY (0) Improving the precision of classification trees. Ann Appl Stat :1-1 Rodríguez JJ, Kuncheva LI, Alonso CJ (00) Rotation forest: A new classifier ensemble method. IEEE Trans Pattern Anal Mach Intell ():11-10 Schapire RE (10) The strength of weak learnability. Mach Learn :1- Statlib (0) Datasets archive. Carnegie Mellon University, Department of Statistics. Terhune JM (1) Geographical variation of harp seal underwater vocalisations. Can J Zool :- Zhu J, Rosset S, Zou H, Hastie T (00) Multi-class Adaboost. Stat Interface :-0 Appendices Appendix 1: Pseudocode of CLDA Given: X: the objects in the training data set (an n p matrix) C: number of classes p: number of variables S i : covariance of class i Procedure: 1. Compute the class centroid matrix M Cxp, where the (i, j) entry is the mean of class i for variable j. Compute the common covariance matrix W : W = C i=1 (n i 1)S i. Compute M = MW 1/ by using eigen-decomposition of W. Obtain the between covariance matrix B by computing the covariance matrix of M. Do the eigenvalue-decomposition of B such that B = V DV T. The columns v i of V define the coordinates of the optimal subspaces. Convert X to the coordinates in the new subspace:. Z i is the i th canonical coordinate Appendix : Pseudocode of Canonical Forest Input: Z i = v T i W 1/ X X: training data composed of n instances (an n p matrix) Y : the labels of the training data (an n 1 vector) B: number of classifiers in an ensemble K: number of subsets

17 w = (1,...,C): set of class labels Training Phase: For i = 1,...,B 1. Randomly split F (the feature set) into K subsets: F i, j (for j = 1,...,K). For j = 1,...,K Let X i, j be the data matrix that corresponds to the features in F i, j Draw a bootstrap sample X i, j (with sample size % of the number of instances in X i, j) from X i, j Apply CLDA on X i, j to obtain a coefficient matrix A i, j. Arrange A i, j ( j = 1,...,K) into a block diagonal matrix R i. Construct the rotation matrix R a i by rearranging the rows of R i so that they correspond to the original order of features in F. Use (XR a i,y ) as the training data to build a classifier L i Test Phase: For a given instance x, the predicted class label from classifier L is: L(x) = argmax y w B i=1 I(L i (xr a i ) = y)

A Weight-Adjusted Voting Algorithm for Ensemble of Classifiers

A Weight-Adjusted Voting Algorithm for Ensemble of Classifiers Hyunjoong Kim a,, Hyeuk Kim b, Hojin Moon c, Hongshik Ahn b a Department of Applied Statistics, Yonsei University, Seoul 120-749, South Korea