Computational Statistics Canonical Forest

Size: px
Start display at page:

Download "Computational Statistics Canonical Forest"

Transcription

1 Computational Statistics Canonical Forest --Manuscript Draft-- Manuscript Number: Full Title: Article Type: Keywords: Corresponding Author: COST-D R Canonical Forest Original Paper Canonical linear discriminant analysis; Classification; Ensemble; Linear discriminant analysis; Rotation Forest Hongshik Ahn, Ph.D. SUNY Stony Brook Stony Brook, NY UNITED STATES Corresponding Author Secondary Information: Corresponding Author's Institution: SUNY Stony Brook Corresponding Author's Secondary Institution: First Author: Yu-Chuan Chen First Author Secondary Information: Order of Authors: Yu-Chuan Chen Hyejung Ha Hyunjoong Kim, Ph.D. Hongshik Ahn, Ph.D. Order of Authors Secondary Information: Abstract: We propose a new classification ensemble method named Canonical Forest. The new method uses canonical linear discriminant analysis (CLDA) and bootstrapping to obtain accurate and diverse classifiers that constitute an ensemble. We note CLDA serves as a linear transformation tool rather than a dimension reduction tool. Since CLDA will find the transformed space that separates the classes farther in distribution, classifiers built on this space will be more accurate than those on the original space. To further facilitate the diversity of the classifiers in an ensemble, CLDA is applied only on a partial feature space for each bootstrapped data. To compare the performance of Canonical Forest and other widely used ensemble methods, we tested them on real or artificial data sets. Canonical Forest performed significantly better in accuracy than other ensemble methods in most data sets. According to the investigation on the bias and variance decomposition, the success of Canonical Forest can be attributed to the variance reduction. Powered by Editorial Manager and ProduXion Manager from Aries Systems Corporation

2 *Authors' Response to Reviewers' Comments Click here to download Authors' Response to Reviewers' Comments: CS_response_00.docx Response to the reviewers Title: Canonical Forest We appreciate your review of our manuscript. We have revised the manuscript accordingly. Our response to the reviewers comments is given below. The page numbers and line numbers are based on the revised manuscript. Response to the comments of Co-editor 1. At the end of your introduction, please be specific and indicate in which of the following sections the listed content will be mentioned. Also mention here that the pseudo-code can be found in the two appendices. Response: We have added a brief introduction of each section in the last paragraph of Section 1 on page.. In addition to Fig (mentioned by the reviewer - these are far too small!), I think that Fig 1 also could deserve slightly larger labels for both axes. Response: We have enlarged the labels for both axes in Fig 1.. For Fig, when you follow the comment from the reviewer, it may be a good idea to consider the interval (0, 1) for kappa and perhaps (0, 0.) for the error rates. Response: We have fixed the interval (0, 1) for kappa and (0, 0.) for the error rates throughout the plots in Fig.. When you redo this plot, correct Lengend to Legend and enlarge the font size a little bit. Moreover, you can reduce the spacing between the figures via the par() command in R. Response: We have corrected this typo and also enlarged the font size. Response to the comments of Reviewer 1. Can you increase the size of the labels and axis labels in figure? The font size is too small. Response: We have enlarged both the size of labels and axes in Fig.. In figure, how about changing the range on the x-axis and y-axis to be the same throughout the plots so that we can compare across the data sets? Response: We have fixed the interval (0, 1) for kappa and (0, 0.) for the error rates throughout the plots in Fig.

3 *Manuscript Click here to view linked References Computational Statistics manuscript No. (will be inserted by the editor) Canonical Forest Yu-Chuan Chen Hyejung Ha Hyunjoong Kim Hongshik Ahn Abstract We propose a new classification ensemble method named Canonical Forest. The new method uses canonical linear discriminant analysis (CLDA) and bootstrapping to obtain accurate and diverse classifiers that constitute an ensemble. We note CLDA serves as a linear transformation tool rather than a dimension reduction tool. Since CLDA will find the transformed space that separates the classes farther in distribution, classifiers built on this space will be more accurate than those on the original space. To further facilitate the diversity of the classifiers in an ensemble, CLDA is applied only on a partial feature space for each bootstrapped data. To compare the performance of Canonical Forest and other widely used ensemble methods, we tested them on real or artificial data sets. Canonical Forest performed significantly better in accuracy than other ensemble methods in most data sets. According to the investigation on the bias and variance decomposition, the success of Canonical Forest can be attributed to the variance reduction. Keywords Canonical linear discriminant analysis; Classification; Ensemble; Linear discriminant analysis; Rotation Forest 1 Introduction Classification is a procedure to build a model using the class labels and their features (predictor variables) in a training dataset and then to predict the class labels of future instances. Current popular classification techniques include decision trees, logistic regression, neural networks, support vector machines, and linear discriminant analysis. The concept of combining many classifiers, known as classification ensemble, was proposed to improve the overall classification accuracy. A classification ensemble summarizes the predictions from multiple classifiers through majority voting or weighted voting to form the final prediction. It is known that a classification ensemble that consists of several weak classifiers would improve the classification accuracy (Ji and Ma 1; Hastie et al. 001). Here, a weak classifier is defined as the one that performs slightly better than random guess. The accuracy of a classification ensemble can be improved further by diversifying the classifiers that constitute an ensemble. Two ensemble methods, Boosting (Schapire 10; Freund and Schapire 1; 1) and Bagging (Breiman 1), have been widely used for that purpose. Recently, Rotation Forest (Rodríguez et al. 00) has been proposed to further diversify the classifiers in an ensemble by using principal component analysis. All these methods use multiple re-sampled or re-weighted training dataset to build diverse classifiers as described below. Boosting changes the distribution of the training dataset during the construction of ensemble, then combines the classifiers using weighted voting. Higher weights will be assigned to those classifiers with better classifica- Yu-Chuan Chen Hongshik Ahn ( ) Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY -00 SUNY Korea, Incheon 0-0, South Korea Phone: 1-- Fax: hongshik.ahn@stonybrook.edu Hyejung Ha Hyunjoong Kim Department of Applied Statistics, Yonsei University, Seoul 10-, South Korea

4 tion accuracy. Adaboost (Freund and Schapire 1; 1) is the most widely used boosting method. Samme (Zhu et al. 00) is a new algorithm that naturally extends Adaboost to the multiclass case without reducing it to multiple two-class problems. Recently, a new boosting algorithm that is specifically designed for the asymmetric mislabeled data named Asymmetric η-boost was proposed by Hayashi (01). Asymmetric η-boost is a generalization algorithm of Adaboost that adopts the asymmetric η-loss function making the boosting algorithm more resistant to asymmetric mislabeled data. Bagging uses bootstrapped training data at each stage of ensemble to build diverse classifiers. A simple majority voting is used to combine the fitted classifiers to form the final classification. Random Forest (Breiman 001), a variant version of Bagging, is a decision tree-based ensemble method. It randomly uses a feature subset instead of all the features to find a split in growing a decision tree. Denote p as the number of all the features. Square root of p is the default feature subset size of a node in the R Random Forest package named randomforest. Ahn et al. (00) observed that this default has consistently good results in many data sets. Double-Bagging (Hothorn and Lausen 00) is a CLDA based ensemble method that combines the algorithms of CLDA and Bagging. For each base classifier, Double-Bagging first takes a bootstrap sample from training data and then applies CLDA to the out of bag (OOB) sample to obtain the coefficient matrix. The canonical features can be computed using this coefficient matrix in the bootstrap sample. The base classifier is then built using both the original features and the canonical features. Finally, a simple majority voting is used to combine all these base classifiers. Like Bagging and Random Forest, Rotation Forest (Rodríguez et al. 00) fits all the classifiers in parallel. To fit a classifier in an ensemble, it randomly splits the training data of p features into K subset with roughly m = p/k features. Principal component analysis (PCA) is applied for each of K subset using the bootstrapped data within the subset. Then a rotation matrix containing the coefficients obtained from PCA is multiplied to the original training data to obtain a rotated training data. Finally a classifier is fitted using the rotated training data. This process is repeated to constitute an ensemble. A majority voting is used to combine the fitted classifiers. Kestler et al. (0) proposed an ensemble method by combining many simple threshold classifiers named rays (Anthony and Biggs 1) using two different voting schemes: majority voting and conjunction (unanimity voting). They also use two different approaches to generate these rays. One approach is using univariate feature selection evaluated by area under the ROC curve (AUC) and another approach is using greedy feature selection. In this study, we developed a new ensemble method called Canonical Forest using majority vote. We explained the entire algorithm of Canonical Forest in Section, and provided the pseudo code in the Appendices. In Section, we show the results of comparing Canonical Forest with current popular ensemble methods including Bagging, Random Forest, Adaboost, Samme, and Rotation Forest. In Section, we investigate the bias and variance decomposition for each method. In Section, we provide a discussion. Method.1 Feature Extraction and CLDA For high-dimensional data with redundant information, the data can be transformed into a reduced set of features which is called feature extraction. If the feature extraction is successful, then the set of new features should contain most of the information in a dataset. PCA is the most widely used method in feature extraction. However, when the data consist of several classes, canonical linear discriminant analysis (CLDA) is a better approach because PCA does not utilize the class information. The general idea of CLDA is to find a linear combination to maximize the between-class variance relative to the within-class variance (Hastie et al. 001). That is, it finds a linear transformation of features which separates the class distribution as far as possible. In this sense, classification using the features extracted by CLDA performs better compared to the one using original features. The algorithm of CLDA modified from Hastie et al. (001) is given in Appendix 1.. Canonical Forest The main difference between Canonical Forest and Rotation Forest (Rodríguez et al. 00) is on the method for feature extraction. CLDA is used for Canonical Forest, while PCA is used for Rotation Forest.

5 Let x = [x 1,...,x p ] T be an instance represented by p features and X be the training data composed of n instances in a form of an n p matrix. Let Y be the class labels (1,...,C) of the training data where Y = [y 1,...,y n ] T. The classifiers in an ensemble are denoted by L 1,...,L B and the feature set is denoted by F. To set up the training data for classifier L i, we use the following steps. 1. Randomly split F into K subsets. The subsets are made disjoint to enhance the diversity of an ensemble. For simplicity, assume that p is divisible by K. Then each feature subset contains m = p/k features. If p is not divisible by K, we let m = [p/k] + 1 and the last subset is set to contain the remaining features.. Denote F i, j as the jth subset of features of the training data for classifier L i. For each of such subsets, use bootstrap resampling with % of the original sample size. Run CLDA on F i, j and obtain a coefficient matrix A i, j = [A (1) i, j,a() i, j,...,a(m) i, j ] of size m m. Note that each of A (1) i, j,a() i, j,...,a(m) i, j is of size m because all the canonical components are kept without dimension reduction.. Arrange the obtained coefficient matrix A i, j into a p p block diagonal matrix R i, where R i = A (1) i,1,a() i,1,...,a(m) i, A (1) i,k,a() i,k,...,a(m) i,k. Construct the rotation matrix R a i (an p p matrix) by rearranging the rows of R i so that they correspond to the original features in F.. The new training data for classifier L i is (XR a i,y ). The pseudocode for Canonical Forest is given in Appendix. Since the centroids of C classes in p-dimensional input space span at most C 1 dimensional subspace, it is common to extract C 1 features in CLDA for dimension reduction in general. However, we use CLDA as a linear transformation tool rather than a dimension reduction tool for this study. Therefore, we let the number of extracted features to be the number of the original features. This results in m extracted features in each subset of features even if m is greater than C 1. Although these extra features may not contribute much to the discriminatory power, they will encourage the classifiers to be more diverse and it can yield higher accuracy of the ensemble. Results An experiment is conducted using real or artificial data sets to compare Canonical Forest with Bagging, Adaboost, Samme, Random Forest, and Rotation Forest. Decision trees were used as the base classifiers for all the ensemble methods. We used unpruned decision trees as base classifiers except for Adaboost and Samme. For Adaboost and Samme, we set the maximum depth of each single tree to be the number of classes, which is the default setting in the R package called adaboost.m1. A decision tree program called rpart available in the R package is used for the experiment. Rpart is based on the CART (Classification and Regression Trees) algorithm (Breiman et al. 1). In Random Forest, we set the number of features chosen at each node equal to the square root of the number of all features, which is the default setting in the R Random Forest package named randomforest. In Canonical Forest and Rotation Forest, the number of features in each subset is set to be m =. If p is not divisible by, then the last subset was completed with the remaining features (1 or features). The data are summarized in Table 1. Most of the data come from UCI Data Repository (Asuncion and Newman 00) and package mlbench (Leisch and Dimitriadou 0) of the R library. Since PCA and CLDA cannot be applied to discrete features, all the discrete features have been removed from each data set. To compare the accuracy among different ensemble methods for ensemble sizes,,, 1,, and, twenty repetitions of -fold cross-validation were performed for each of the data set. A paired t-test was used for each data to measure the statistical significance between Canonical Forest and the others. Figure 1 shows the boxplots comparing Canonical Forest with the other classification methods using statistics from the paired t-test. A positive paired t-test statistic indicates that Canonical Forest performs better. Figure 1 shows that the t-test statistic tends to increase as the ensemble size increases for a single tree and Bagging; t-test statistic tends to decrease as ensemble size increases for Random Forest; no obvious trend is found for Adaboost, Samme, and Rotation Forest. In general, we observed that Canonical Forest performs better than any other ensemble methods at each ensemble size.

6 Table 1 Description of the data sets. Data Set Observations Continuous Class Source Features aba 1 UCI aus 0 UCI bld UCI bod 0 Heinz et al. (00) bos 0 1 UCI cir 00 R library dia Loh (0) ech 1 UCI fis Kim and Loh (00) hea 0 UCI int 00 Kim et al. (0) ion 1 UCI iri 0 UCI lak Loh (0) led 000 UCI mam 1 R library pid UCI pks 1 R library pov Kim and Loh (001) rng 00 R library sea 000 Terhune (1) snr 0 0 R library spe UCI trn 00 R library twn 00 R library usn 0 Statlib (0) veh 1 UCI vol 1 Loh (0) vow 0 UCI To compare the accuracies of methods in a large ensemble size, we fixed the ensemble size at B = and 00. The average accuracies are shown in Tables and. The comparison was done by using paired t-test at a two-sided significance level of α = 0.0. We also compared the performance of LDA with the performance of Canonical Forest and found that Canonical Forest (ensemble size is ) is significantly better than LDA in most of the data sets (1 significantly better and significantly worse). Since our focus in this paper is to show that Canonical Forest is quite a comparable ensemble method comparing to other current popular ensemble methods, we decided not to include LDA in Tables and to make the tables more concise. The reason why we included trees in Tables and is because trees are the base classifiers in all ensemble methods and therefore are listed here as a reference. Tables and show summaries of the comparisons given in Tables and, respectively. The entry a i j shows the frequency that the method in column ( j) is more accurate than the method in row (i). The number in the parentheses shows the frequency that these differences are statistically significant. In Table, for example, the value in row, column is (1). This means that Canonical Forest was more accurate than Bagging in of the comparisons and less accurate in comparisons. The number in the parentheses indicates that Canonical Forest was significantly better than Bagging in 1 data sets. In of the remaining cases, Bagging was significantly better than Canonical Forest (the entry in row, column is ()). Tables and present the rankings of the methods based on the frequency that each method was significantly more accurate and significantly less accurate than other methods. In Table, for example, the number of wins for Canonical Forest is. It is obtained by the sum of the numbers in the parentheses in the column of Canonical Forest in Table. Similarly, the number of losses is obtained by the sum of the numbers in the parentheses in the row of Canonical Forest which is. Therefore, the dominance rank of Canonical Forest is = at ensemble size B =. Tables through show that Canonical Forest outperformed other widely used ensemble methods. The dominance ranks of Canonical Forest ( and ) were substantially larger than that of the second best at both ensemble sizes. To confirm that the superiority of Canonical Forest is not just by chance, we performed the exact Binomial test on the classification accuracy of Tables and. The one-sided test was adopted to get the p-values for the

7 Table Classification accuracy of tree and ensemble methods compared with Canonical Forest at ensemble size B =. Data Set Tree Bagging Adaboost Samme Random Rotation Canonical Forest Forest Forest aba aus bld bod bos cir dia ech fis hea int ion iri lak led mam pid pks pov rng sea snr spe trn twn usn veh vol vow (win/tie/loss) (1/1/) (//1) (//0) (//1) (//1) (//) + Canonical Forest is significantly better, - Canonical Forest is significantly worse, level of significance 0.0 alternative hypothesis that Canonical Forest performs better than another method. Table shows the p-values of the exact Binomial test. We used Holm-Bonferroni (Holm 1) correction to adjust the significance level for a multiple test to maintain the overall type I error rate of α = 0.0. The six adjusted α, from the smallest to the largest, would be.00,.01,.0,.01,.0, and.0. The p-values arranged in increasing order are compared with the Holm-Bonferroni adjusted α values. The Holm-Bonferroni multiple comparison is performed sequentially beginning with the smallest p-value. As a result, for the ensemble size of B =, Canonical Forest is significantly more accurate than the other methods. Except Samme, for the ensemble size of B = 00, Canonical Forest is also significantly more accurate than the other methods. The p-value for the comparison with Samme, however, is quite close to the adjusted significance level of α = 0.0. We also conducted an experiment on investigating the sample size as a factor of the gain of Canonical Forest vs other methods using the randomly selected data sets: aus, bos, dia, pid, snr, trn, vol, and vow. For each data set, the bootstrap sampling method was used to generate each different sample size. Ten repetitions were conducted for each sample size and then the average was taken. There are no obvious trends found when comparing Canonical Forest with Tree, Bagging, Random Forest and Rotation Forest. However, when comparing Canonical Forest with Adaboost and Samme, the gain of Canonical Forest versus each of these two methods tends to decrease as the sample size increases for most of these data sets. In addition to the accuracy, we investigated the diversity of each ensemble method. The kappa statistic κ (Cohen 10) is a measure of agreement between two categorical variables. The agreement between two classifiers L i and L j can be measured by κ (Kuncheva and Whitaker 00; Rodríguez et al. 00). An ensemble with B classifiers will have B(B 1)/ pairs of classifiers (L i,l j ). Higher diversity among classifiers in an ensemble would produce small κ statistics. Figure shows the κ-error diagrams for selected data set. The κ-error diagrams of the other data set are similar to one of these plots. For each κ-error diagram, the x-axis indicates the kappa statistic κ and the y-axis indicates the averaged error of L i and L j, denoted as E i, j = (E i + E j )/, where E i and E j are the error rates of L i and L j, respectively. Since small values of κ indicate that the classifiers are more diverse and small values of E i, j

8 Table Classification accuracy of tree and ensemble methods compared with Canonical Forest at ensemble size B = 00. Data Set Tree Bagging Adaboost Samme Random Rotation Canonical Forest Forest Forest aba aus bld bod bos cir dia ech fis hea int ion iri lak led mam pid pks pov rng sea snr spe trn twn usn veh vol vow (win/tie/loss) (1/1/) (//1) (//1) (//1) (//1) (//) + Canonical Forest is significantly better, - Canonical Forest is significantly worse, level of significance Some features of data set hea were removed due to the computational problems Table Summary of comparisons among different methods with ensemble size B =. Tree Bagging Adaboost Samme Random Rotation Canonical Forest Forest Forest Tree - () () (1) () () () Bagging (0) - () () 0 () 1 () (1) Adaboost () () - () 1 () 1 () (0) Samme () () () - 1 () 1 () 1 (1) Random Forest () () () () - (1) (1) Rotation Forest (1) 1 () () 1 () () - 1 () Canonical Forest 1 (1) () () () () () - The entry a i j shows the frequency that the method in column ( j) is more accurate than the method in row (i). The number in parentheses shows the frequency that these differences are statistically significant. indicate a high accuracy, the ideal location of dots is in the lower left corner. Five hundred classifiers were fitted using training data, then the result of the test data were used to calculate the κ statistics and the error rates. By using the κ-error diagram, we can see the relative position of the ensemble methods for each data set. To help read Figure, take data set aus as an example. Samme has the largest diversity but also has the highest error rate. Canonical Forest has the lowest error rate (the position of Canonical Forest is lower on the y-axis) but slightly less diverse (the position of Canonical Forest is farther on the x-axis) than the other four ensemble methods. Figure shows contour plots of κ-error diagrams for the six representative data sets. We used contour plots to make the differences look clear. Taking data set fis for example, Canonical Forest is on the lower right corner due to its lower error rate and less diversity than the other ensemble methods; Samme is on the upper left corner due to its higher error rate and higher diversity. Note that Samme is not shown in the plots for bod, bos, and sea because of very high error rate and diversity. Canonical Forest appears to outperform the other ensemble methods in terms of accuracy while the diversity is quite similar to that of Bagging and Rotation Forest and less diverse

9 Table Summary of comparisons among different methods with ensemble size B = 00. Tree Bagging Adaboost Samme Random Rotation Canonical Forest Forest Forest Tree - () () 1 (1) () () () Bagging (0) - (1) () 1 (1) 1 () (1) Adaboost () () - () 1 () 1 () 1 (1) Samme () 1 () () - 1 (1) 1 () 1 (1) Random Forest (1) () () () - () 1 (1) Rotation Forest (1) () () () () - 1 () Canonical Forest 1 (1) () () () () () - The entry a i j shows the frequency that the method in column ( j) is more accurate than the method in row (i). The number in parentheses shows the frequency that these differences are statistically significant. Table Dominance ranks of the methods using the significant differences from the results in Table. Method Dominance rank Wins Losses (Wins-Losses) Canonical Forest Rotation Forest Random Forest Samme 0 Bagging - Adaboost - Tree - 1 Table Dominance ranks of the methods using the significant differences from the results in Table. Method Dominance rank Wins Losses (Wins-Losses) Canonical Forest Random Forest 1 Rotation Forest Adaboost Samme - 0 Bagging - Tree -1 Table P-values of exact Binomial test for comparing Canonical Forest with other methods. Adjusted α Method 1 Tree Bagging Adaboost Random Forest Samme Rotation Forest p-value Method Tree Bagging Adaboost Random Forest Rotation Forest Samme p-value The result is significant by using Holm Bonferroni Correction at α = p-value was obtained for comparing Canonical Forest with other methods at ensemble size of p-value was obtained for comparing Canonical Forest with other methods at ensemble size of 00 than Samme and Random Forest. This suggests that if we can increase diversity of Canonical Forest, it could be a more powerful ensemble method.

10 Fig. 1 Boxplots of paired t-test statistics between Canonical Forest and others. The x-axis is the ensemble size, B, and the y-axis indicates the paired t-test statistic. CanF stands for Canonical Forest; RnF stands for Random Forest; RotF stands for Rotation Forest.

11 Bagging Samme Random Forest Rotation Forest Canonical Forest aus bos fis hea pks rng Fig. κ-error diagram for selected data sets. x-axis = κ, y-axis = E i, j (average error of the pair of classifiers).

12 Fig. Contour plot of κ-error diagrams for selected data sets.

13 Bias and Variance Decomposition The bias and variance decomposition of the error (Geman et al. 1) is a popular and useful approach. The bias measures the distance between prediction of the classifier and the target function and the variance measures variation among the predictions from different classifiers. Several authors, including Kong and Dietterich (1), Kohavi and Wolpert (1), and Breiman (1), have proposed different methods for the bias and variance decomposition. In this study we used the decomposition proposed by Kohavi and Wolpert (1). The same data sets as in Section (except data set aus due to a computational problem) were used to analyze the bias and variance decomposition. We used -fold cross-validation to estimate the bias and variance. An ensemble of size 00 is used for each ensemble method. Table shows the comparison of bias. To help read Win - Loss in Table, we take Samme as an example. The Win - Loss is 1 - for Samme when compared to Canonical Forest. This means that the bias of Samme is smaller than that of Canonical Forest for 1 data sets and larger for data sets. Samme appears to be the best among all the ensemble methods considered here in reducing bias while Canonical Forest is in the middle. Table compares the variance. It shows that Canonical Forest dominates all the other methods in reducing the variance. In Tables and, the p-values were obtained by performing one-sided Wilcoxon signed-rank test. We found that the high accuracy of Canonical Forest shown in the previous section was mainly due to the variance reduction. Table Comparison of contribution of Bias to error. Data Set Tree Bagging Adaboost Samme Random Rotation Canonical Forest Forest Forest aba bld bod bos cir dia ech fis hea int ion iri lak led mam pid pks pov rng sea snr spe trn twn usn veh vol vow Win - Loss p-value

14 Table Comparison of contribution of Variance to error. Data Set Tree Bagging Adaboost Samme Random Rotation Canonical Forest Forest Forest aba bld bod bos cir dia ech fis hea int ion iri lak led mam pid pks pov rng sea snr spe trn twn usn veh vol vow Win - Loss p-value Discussion We introduced a new ensemble method called Canonical Forest. It uses CLDA to perform a linear transformation on the original input data so that the transformed input data can be more distinct among classes. To enhance the diversity, the features are split into subsets, then CLDA is applied on each subset separately. It should be noted that disjoint subsets would yield enhanced diversity because there is no overlap between subsets. However, the number of subsets is not necessarily related to the diversity of classifiers in Canonical Forest. A classifier is built using all the canonical components when applying CLDA. Here CLDA serves as a linear transformation tool rather than a dimension reduction tool since we keep all the features while applying CLDA. The reason why we chose CLDA is because its a supervised learning method. When linearly transforming the data, CLDA utilizes the class information and makes the classes more separable after transformation. Therefore, this makes CLDA a better linear transformation tool than other dimension reduction methods when applied in classification analysis. Canonical Forest is formed by combining these classifiers with a majority vote. Although both Double-Bagging and Canonical Forest are CLDA based ensemble methods, they are different due to the fact that CLDA is applied to OOB sample in Double-Bagging while CLDA is applied to bootstrap sample in Canonical Forest. Besides, CLDA is applied to the whole feature space in Double-Bagging while CLDA is applied to each mutually exclusive feature subset in Canonical Forest. Canonical Forest performed better in terms of accuracy than the other widely used ensemble methods especially when the ensemble size is small based on our experiment. Exact Binomial test showed the superiority of Canonical Forest over other methods. Through the investigation of bias and variance decomposition, we found that the reduction of variance played a major role in improving the accuracy of Canonical Forest. The gap in performance between Canonical Forest and the other methods decreased a little as the ensemble size increased. By investigating the κ-error diagram, we found that this is because Canonical Forest is slightly less diverse than the other ensemble methods. Nevertheless, Canonical Forest still showed a better performance in terms of the classification accuracy when it was compared to the other methods with the ensemble size of 00 which is nearly optimal ensemble size for the other methods. 1

15 In this experiment, the number of features in each subset was set to be m = for both Rotation Forest and Canonical Forest because this was the parameter setting used in the experiment in Rotation Forest (Rodríguez et al. 00). We used the same parameter setting to make a consistent comparison. We will also further investigate the choice of m on the performance of Canonical Forest in future study like Kuncheva and Rodríguez (00) did. It also should be noted that although we removed the discrete features from all the data in this experiment because CLDA and PCA cannot be applied to discrete features, it is acceptable to apply CLDA or PCA to ordinal discrete features by treating them as continuous features. Since Canonical Forest is a CLDA based ensemble method, the performance of Canonical Forest may depend on the performance of CLDA. Therefore, Canonical Forest will perform better in the situations where CLDA performs better. In general, CLDA works better when the discriminatory information is in the mean instead of the variance of the data. Besides, CLDA also works better in balanced data (i.e., the number of instances for each class is roughly equal) than in unbalanced data because it needs representative instances for each class to make a better separation among classes. Acknowledgement Hyunjoong Kim s work was partly supported by Basic Science Research program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education, Science, and Technology(01R1A1A01). Hongshik Ahn s work was partially supported by the IT Consiliance Creative Project through the Ministry of Knowledge Economy, Republic of Korea. References Ahn H, Moon H, Fazzari MJ, Lim N, Chen JJ, Kodell RL (00) Classification by ensembles from random partitions of high-dimensional data. Comput Stat Data Anal 1:1-1 Anthony M, Biggs N (1) Computational learning theory. Cambridge University Press, Cambridge Asuncion A, Newman DJ (00) UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Science. mlearn/mlrepository.html Breiman L (1) Bagging predictors. Mach Learn :1-0 Breiman L (1) Arcing classifiers. Ann Stat :01- Breiman L (001) Random forest. Mach Learn :- Breiman L, Friedman JH, Olshen RA, Stone CJ (1) Classification and Regression Trees. Wadsworth, Belmont, CA Cohen J (10) A coefficient of agreement for nominal scales. Educ Psychol Meas 0(1): - Freund Y, Schapire R (1) Experiments with a new boosting algorithm. In: Proceedings of the Thirteenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco, pp - Freund Y, Schapire R (1) A decision-theoretic generalization of online learning and an application to boosting. J Comput Syst Sci :- Geman S, Bienenstock E, Doursat R (1) Neural networks and the bias/variance dilemma. Neural Comput :1- Hastie T, Tibshirani R, Friedman J (001) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Verlag, New York Hayashi K (01) A boosting method with asymmetric mislabeling probabilities which depend on covariates. Comput Stat :0-1 Heinz G, Peterson LJ, Johnson RW, Kerk CJ (00) Exploring relationships in body dimensions. J Stat Educ. publications/jse/vn/datasets.heinz.html Holm S (1) A simple sequentially rejective multiple test procedure. Scand J Stat :-0 Hothorn T, Lausen B (00) Double-Bagging: Combining classifiers by bootstrap aggregation. Pattern Recognit :0-0 Ji C, Ma S (1) Combinations of weak classifiers. IEEE Trans Neural Netw (1):- Kestler HA, Lausser L, Linder W, Palm G (0) On the fusion of threshold classifiers for categorization and dimensionality reduction. Comput Stat :1-0 Kim H, Loh WY (001) Classification trees with unbiased multiway splits. J Am Stat Assoc :-0

16 Kim H, Loh WY (00) Classification trees with bivariate linear discriminant node models. J Comput Graph Stat 1:1-0 Kim H, Kim H, Moon H, Ahn H (0) A weight-adjusted voting algorithm for ensemble of classifiers. J Korean Stat Soc 0:- Kohavi R, Wolpert DH (1) Bias plus variance decomposition for zero-one loss functions. In: Proceedings of the Thirteenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco, pp - Kong EB, Dietterich TG (1) Error-correcting output coding corrects bias and variance. In: Proceedings of the Twelfth International Conference on Machine Learning. Morgan Kaufmann, San Francisco, pp -1 Kuncheva LI, Rodríguez JJ (00) An experimental study on rotation forest ensembles. In: Haindl H, Kittler J, Roli F (eds) Multiple Classifier Systems. Springer, Berlin, pp - Kuncheva LI, Whitaker CJ (00) Measures of diversity in classifier ensembles. Mach Learn 1:11-0 Leisch F, Dimitriadou E (0) mlbench: Machine Learning Benchmark Problems. R package version.0-0 Loh WY (0) Improving the precision of classification trees. Ann Appl Stat :1-1 Rodríguez JJ, Kuncheva LI, Alonso CJ (00) Rotation forest: A new classifier ensemble method. IEEE Trans Pattern Anal Mach Intell ():11-10 Schapire RE (10) The strength of weak learnability. Mach Learn :1- Statlib (0) Datasets archive. Carnegie Mellon University, Department of Statistics. Terhune JM (1) Geographical variation of harp seal underwater vocalisations. Can J Zool :- Zhu J, Rosset S, Zou H, Hastie T (00) Multi-class Adaboost. Stat Interface :-0 Appendices Appendix 1: Pseudocode of CLDA Given: X: the objects in the training data set (an n p matrix) C: number of classes p: number of variables S i : covariance of class i Procedure: 1. Compute the class centroid matrix M Cxp, where the (i, j) entry is the mean of class i for variable j. Compute the common covariance matrix W : W = C i=1 (n i 1)S i. Compute M = MW 1/ by using eigen-decomposition of W. Obtain the between covariance matrix B by computing the covariance matrix of M. Do the eigenvalue-decomposition of B such that B = V DV T. The columns v i of V define the coordinates of the optimal subspaces. Convert X to the coordinates in the new subspace:. Z i is the i th canonical coordinate Appendix : Pseudocode of Canonical Forest Input: Z i = v T i W 1/ X X: training data composed of n instances (an n p matrix) Y : the labels of the training data (an n 1 vector) B: number of classifiers in an ensemble K: number of subsets

17 w = (1,...,C): set of class labels Training Phase: For i = 1,...,B 1. Randomly split F (the feature set) into K subsets: F i, j (for j = 1,...,K). For j = 1,...,K Let X i, j be the data matrix that corresponds to the features in F i, j Draw a bootstrap sample X i, j (with sample size % of the number of instances in X i, j) from X i, j Apply CLDA on X i, j to obtain a coefficient matrix A i, j. Arrange A i, j ( j = 1,...,K) into a block diagonal matrix R i. Construct the rotation matrix R a i by rearranging the rows of R i so that they correspond to the original order of features in F. Use (XR a i,y ) as the training data to build a classifier L i Test Phase: For a given instance x, the predicted class label from classifier L is: L(x) = argmax y w B i=1 I(L i (xr a i ) = y)

A Weight-Adjusted Voting Algorithm for Ensemble of Classifiers

A Weight-Adjusted Voting Algorithm for Ensemble of Classifiers A Weight-Adjusted Voting Algorithm for Ensemble of Classifiers Hyunjoong Kim a,, Hyeuk Kim b, Hojin Moon c, Hongshik Ahn b a Department of Applied Statistics, Yonsei University, Seoul 120-749, South Korea

More information

Active Sonar Target Classification Using Classifier Ensembles

Active Sonar Target Classification Using Classifier Ensembles International Journal of Engineering Research and Technology. ISSN 0974-3154 Volume 11, Number 12 (2018), pp. 2125-2133 International Research Publication House http://www.irphouse.com Active Sonar Target

More information

Bagging and Boosting for the Nearest Mean Classifier: Effects of Sample Size on Diversity and Accuracy

Bagging and Boosting for the Nearest Mean Classifier: Effects of Sample Size on Diversity and Accuracy and for the Nearest Mean Classifier: Effects of Sample Size on Diversity and Accuracy Marina Skurichina, Liudmila I. Kuncheva 2 and Robert P.W. Duin Pattern Recognition Group, Department of Applied Physics,

More information

Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC)

Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC) Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC) Eunsik Park 1 and Y-c Ivan Chang 2 1 Chonnam National University, Gwangju, Korea 2 Academia Sinica, Taipei,

More information

Ensemble Methods: Jay Hyer

Ensemble Methods: Jay Hyer Ensemble Methods: committee-based learning Jay Hyer linkedin.com/in/jayhyer @adatahead Overview Why Ensemble Learning? What is learning? How is ensemble learning different? Boosting Weak and Strong Learners

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Statistics and learning: Big Data

Statistics and learning: Big Data Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

1/sqrt(B) convergence 1/B convergence B

1/sqrt(B) convergence 1/B convergence B The Error Coding Method and PICTs Gareth James and Trevor Hastie Department of Statistics, Stanford University March 29, 1998 Abstract A new family of plug-in classication techniques has recently been

More information

RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY

RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY 1 RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY Leo Breiman Statistics Department University of California Berkeley, CA. 94720 leo@stat.berkeley.edu Technical Report 518, May 1, 1998 abstract Bagging

More information

Regularized Linear Models in Stacked Generalization

Regularized Linear Models in Stacked Generalization Regularized Linear Models in Stacked Generalization Sam Reid and Greg Grudic University of Colorado at Boulder, Boulder CO 80309-0430, USA Abstract Stacked generalization is a flexible method for multiple

More information

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Günther Eibl and Karl Peter Pfeiffer Institute of Biostatistics, Innsbruck, Austria guenther.eibl@uibk.ac.at Abstract.

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

A Unified Bias-Variance Decomposition

A Unified Bias-Variance Decomposition A Unified Bias-Variance Decomposition Pedro Domingos Department of Computer Science and Engineering University of Washington Box 352350 Seattle, WA 98185-2350, U.S.A. pedrod@cs.washington.edu Tel.: 206-543-4229

More information

JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA

JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA 1 SEPARATING SIGNAL FROM BACKGROUND USING ENSEMBLES OF RULES JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA 94305 E-mail: jhf@stanford.edu

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

A Unified Bias-Variance Decomposition and its Applications

A Unified Bias-Variance Decomposition and its Applications A Unified ias-ariance Decomposition and its Applications Pedro Domingos pedrod@cs.washington.edu Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195, U.S.A. Abstract

More information

Ensembles of Classifiers.

Ensembles of Classifiers. Ensembles of Classifiers www.biostat.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts ensemble bootstrap sample bagging boosting random forests error correcting

More information

Variance and Bias for General Loss Functions

Variance and Bias for General Loss Functions Machine Learning, 51, 115 135, 2003 c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands. Variance and Bias for General Loss Functions GARETH M. JAMES Marshall School of Business, University

More information

Voting (Ensemble Methods)

Voting (Ensemble Methods) 1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

More information

A Brief Introduction to Adaboost

A Brief Introduction to Adaboost A Brief Introduction to Adaboost Hongbo Deng 6 Feb, 2007 Some of the slides are borrowed from Derek Hoiem & Jan ˇSochman. 1 Outline Background Adaboost Algorithm Theory/Interpretations 2 What s So Good

More information

Diversity-Based Boosting Algorithm

Diversity-Based Boosting Algorithm Diversity-Based Boosting Algorithm Jafar A. Alzubi School of Engineering Al-Balqa Applied University Al-Salt, Jordan Abstract Boosting is a well known and efficient technique for constructing a classifier

More information

Logistic Regression and Boosting for Labeled Bags of Instances

Logistic Regression and Boosting for Labeled Bags of Instances Logistic Regression and Boosting for Labeled Bags of Instances Xin Xu and Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand {xx5, eibe}@cs.waikato.ac.nz Abstract. In

More information

ENSEMBLES OF DECISION RULES

ENSEMBLES OF DECISION RULES ENSEMBLES OF DECISION RULES Jerzy BŁASZCZYŃSKI, Krzysztof DEMBCZYŃSKI, Wojciech KOTŁOWSKI, Roman SŁOWIŃSKI, Marcin SZELĄG Abstract. In most approaches to ensemble methods, base classifiers are decision

More information

BAGGING PREDICTORS AND RANDOM FOREST

BAGGING PREDICTORS AND RANDOM FOREST BAGGING PREDICTORS AND RANDOM FOREST DANA KANER M.SC. SEMINAR IN STATISTICS, MAY 2017 BAGIGNG PREDICTORS / LEO BREIMAN, 1996 RANDOM FORESTS / LEO BREIMAN, 2001 THE ELEMENTS OF STATISTICAL LEARNING (CHAPTERS

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Boosting & Deep Learning

Boosting & Deep Learning Boosting & Deep Learning Ensemble Learning n So far learning methods that learn a single hypothesis, chosen form a hypothesis space that is used to make predictions n Ensemble learning à select a collection

More information

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12 Ensemble Methods Charles Sutton Data Mining and Exploration Spring 2012 Bias and Variance Consider a regression problem Y = f(x)+ N(0, 2 ) With an estimate regression function ˆf, e.g., ˆf(x) =w > x Suppose

More information

Resampling Methods CAPT David Ruth, USN

Resampling Methods CAPT David Ruth, USN Resampling Methods CAPT David Ruth, USN Mathematics Department, United States Naval Academy Science of Test Workshop 05 April 2017 Outline Overview of resampling methods Bootstrapping Cross-validation

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Investigating the Performance of a Linear Regression Combiner on Multi-class Data Sets

Investigating the Performance of a Linear Regression Combiner on Multi-class Data Sets Investigating the Performance of a Linear Regression Combiner on Multi-class Data Sets Chun-Xia Zhang 1,2 Robert P.W. Duin 2 1 School of Science and State Key Laboratory for Manufacturing Systems Engineering,

More information

Look before you leap: Some insights into learner evaluation with cross-validation

Look before you leap: Some insights into learner evaluation with cross-validation Look before you leap: Some insights into learner evaluation with cross-validation Gitte Vanwinckelen and Hendrik Blockeel Department of Computer Science, KU Leuven, Belgium, {gitte.vanwinckelen,hendrik.blockeel}@cs.kuleuven.be

More information

Low Bias Bagged Support Vector Machines

Low Bias Bagged Support Vector Machines Low Bias Bagged Support Vector Machines Giorgio Valentini Dipartimento di Scienze dell Informazione, Università degli Studi di Milano, Italy INFM, Istituto Nazionale per la Fisica della Materia, Italy.

More information

Neural Networks and Ensemble Methods for Classification

Neural Networks and Ensemble Methods for Classification Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks A neural network is a set of connected input/output units (neurons) where each connection has a weight associated

More information

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods The wisdom of the crowds Ensemble learning Sir Francis Galton discovered in the early 1900s that a collection of educated guesses can add up to very accurate predictions! Chapter 11 The paper in which

More information

A Bias Correction for the Minimum Error Rate in Cross-validation

A Bias Correction for the Minimum Error Rate in Cross-validation A Bias Correction for the Minimum Error Rate in Cross-validation Ryan J. Tibshirani Robert Tibshirani Abstract Tuning parameters in supervised learning problems are often estimated by cross-validation.

More information

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Learning Ensembles. 293S T. Yang. UCSB, 2017. Learning Ensembles 293S T. Yang. UCSB, 2017. Outlines Learning Assembles Random Forest Adaboost Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous)

More information

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24 Big Data Analytics Special Topics for Computer Science CSE 4095-001 CSE 5095-005 Feb 24 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Prediction III Goal

More information

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes. Random Forests One of the best known classifiers is the random forest. It is very simple and effective but there is still a large gap between theory and practice. Basically, a random forest is an average

More information

day month year documentname/initials 1

day month year documentname/initials 1 ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi

More information

Machine Learning. Ensemble Methods. Manfred Huber

Machine Learning. Ensemble Methods. Manfred Huber Machine Learning Ensemble Methods Manfred Huber 2015 1 Bias, Variance, Noise Classification errors have different sources Choice of hypothesis space and algorithm Training set Noise in the data The expected

More information

Monte Carlo Theory as an Explanation of Bagging and Boosting

Monte Carlo Theory as an Explanation of Bagging and Boosting Monte Carlo Theory as an Explanation of Bagging and Boosting Roberto Esposito Università di Torino C.so Svizzera 185, Torino, Italy esposito@di.unito.it Lorenza Saitta Università del Piemonte Orientale

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Ensemble Methods and Random Forests

Ensemble Methods and Random Forests Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization

More information

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

More information

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes Support Vector Machine, Random Forests, Boosting December 2, 2012 1 / 1 2 / 1 Neural networks Artificial Neural networks: Networks

More information

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012 Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Neural networks Neural network Another classifier (or regression technique)

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

A Simple Algorithm for Learning Stable Machines

A Simple Algorithm for Learning Stable Machines A Simple Algorithm for Learning Stable Machines Savina Andonova and Andre Elisseeff and Theodoros Evgeniou and Massimiliano ontil Abstract. We present an algorithm for learning stable machines which is

More information

An Empirical Study of Building Compact Ensembles

An Empirical Study of Building Compact Ensembles An Empirical Study of Building Compact Ensembles Huan Liu, Amit Mandvikar, and Jigar Mody Computer Science & Engineering Arizona State University Tempe, AZ 85281 {huan.liu,amitm,jigar.mody}@asu.edu Abstract.

More information

Decision Trees: Overfitting

Decision Trees: Overfitting Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

Top-k Parametrized Boost

Top-k Parametrized Boost Top-k Parametrized Boost Turki Turki 1,4, Muhammad Amimul Ihsan 2, Nouf Turki 3, Jie Zhang 4, Usman Roshan 4 1 King Abdulaziz University P.O. Box 80221, Jeddah 21589, Saudi Arabia tturki@kau.edu.sa 2 Department

More information

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report Hujia Yu, Jiafu Wu [hujiay, jiafuwu]@stanford.edu 1. Introduction Housing prices are an important

More information

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees

More information

Roman Hornung. Ordinal Forests. Technical Report Number 212, 2017 Department of Statistics University of Munich.

Roman Hornung. Ordinal Forests. Technical Report Number 212, 2017 Department of Statistics University of Munich. Roman Hornung Ordinal Forests Technical Report Number 212, 2017 Department of Statistics University of Munich http://www.stat.uni-muenchen.de Ordinal Forests Roman Hornung 1 October 23, 2017 1 Institute

More information

1 Handling of Continuous Attributes in C4.5. Algorithm

1 Handling of Continuous Attributes in C4.5. Algorithm .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Data Mining: Classification/Supervised Learning Potpourri Contents 1. C4.5. and continuous attributes: incorporating continuous

More information

Diversity Regularized Machine

Diversity Regularized Machine Diversity Regularized Machine Yang Yu and Yu-Feng Li and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 10093, China {yuy,liyf,zhouzh}@lamda.nju.edu.cn Abstract

More information

A Unified Bias-Variance Decomposition for Zero-One and Squared Loss

A Unified Bias-Variance Decomposition for Zero-One and Squared Loss From: AAAI-00 Proceedings. Copyright 2000, AAAI www.aaai.org). All rights reserved. A Unified Bias-Variance Decomposition for Zero-One and Squared Loss Pedro Domingos Department of Computer Science and

More information

CS 229 Final report A Study Of Ensemble Methods In Machine Learning

CS 229 Final report A Study Of Ensemble Methods In Machine Learning A Study Of Ensemble Methods In Machine Learning Abstract The idea of ensemble methodology is to build a predictive model by integrating multiple models. It is well-known that ensemble methods can be used

More information

IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 1

IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 1 IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 1 1 2 IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 2 An experimental bias variance analysis of SVM ensembles based on resampling

More information

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7 Bagging Ryan Tibshirani Data Mining: 36-462/36-662 April 23 2013 Optional reading: ISL 8.2, ESL 8.7 1 Reminder: classification trees Our task is to predict the class label y {1,... K} given a feature vector

More information

Generalized Boosted Models: A guide to the gbm package

Generalized Boosted Models: A guide to the gbm package Generalized Boosted Models: A guide to the gbm package Greg Ridgeway April 15, 2006 Boosting takes on various forms with different programs using different loss functions, different base models, and different

More information

Growing a Large Tree

Growing a Large Tree STAT 5703 Fall, 2004 Data Mining Methodology I Decision Tree I Growing a Large Tree Contents 1 A Single Split 2 1.1 Node Impurity.................................. 2 1.2 Computation of i(t)................................

More information

Obtaining Calibrated Probabilities from Boosting

Obtaining Calibrated Probabilities from Boosting Obtaining Calibrated Probabilities from Boosting Alexandru Niculescu-Mizil Department of Computer Science Cornell University, Ithaca, NY 4853 alexn@cs.cornell.edu Rich Caruana Department of Computer Science

More information

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7]

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7] 8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7] While cross-validation allows one to find the weight penalty parameters which would give the model good generalization capability, the separation of

More information

Performance of Cross Validation in Tree-Based Models

Performance of Cross Validation in Tree-Based Models Performance of Cross Validation in Tree-Based Models Seoung Bum Kim, Xiaoming Huo, Kwok-Leung Tsui School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, Georgia 30332 {sbkim,xiaoming,ktsui}@isye.gatech.edu

More information

Announcements Kevin Jamieson

Announcements Kevin Jamieson Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium

More information

Chapter 6. Ensemble Methods

Chapter 6. Ensemble Methods Chapter 6. Ensemble Methods Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Introduction

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit

More information

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007 Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier Seiichi Ozawa, Shaoning Pang, and Nikola Kasabov Graduate School of Science and Technology, Kobe

More information

Chapter 14 Combining Models

Chapter 14 Combining Models Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients

More information

Advanced Statistical Methods: Beyond Linear Regression

Advanced Statistical Methods: Beyond Linear Regression Advanced Statistical Methods: Beyond Linear Regression John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Worshop 28 March 2009 1 http://www.stat.usu.edu/~jrstevens/pcmi

More information

Classification using stochastic ensembles

Classification using stochastic ensembles July 31, 2014 Topics Introduction Topics Classification Application and classfication Classification and Regression Trees Stochastic ensemble methods Our application: USAID Poverty Assessment Tools Topics

More information

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:

More information

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

More information

Supplementary material for Intervention in prediction measure: a new approach to assessing variable importance for random forests

Supplementary material for Intervention in prediction measure: a new approach to assessing variable importance for random forests Supplementary material for Intervention in prediction measure: a new approach to assessing variable importance for random forests Irene Epifanio Dept. Matemàtiques and IMAC Universitat Jaume I Castelló,

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi Boosting CAP5610: Machine Learning Instructor: Guo-Jun Qi Weak classifiers Weak classifiers Decision stump one layer decision tree Naive Bayes A classifier without feature correlations Linear classifier

More information

error rate

error rate Pruning Adaptive Boosting *** ICML-97 Final Draft *** Dragos D.Margineantu Department of Computer Science Oregon State University Corvallis, OR 9733-322 margindr@cs.orst.edu Thomas G. Dietterich Department

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

Variance Reduction and Ensemble Methods

Variance Reduction and Ensemble Methods Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis

More information

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity What makes good ensemble? CS789: Machine Learning and Neural Network Ensemble methods Jakramate Bootkrajang Department of Computer Science Chiang Mai University 1. A member of the ensemble is accurate.

More information

A TWO-STAGE COMMITTEE MACHINE OF NEURAL NETWORKS

A TWO-STAGE COMMITTEE MACHINE OF NEURAL NETWORKS Journal of the Chinese Institute of Engineers, Vol. 32, No. 2, pp. 169-178 (2009) 169 A TWO-STAGE COMMITTEE MACHINE OF NEURAL NETWORKS Jen-Feng Wang, Chinson Yeh, Chen-Wen Yen*, and Mark L. Nagurka ABSTRACT

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Thomas G. Dietterich tgd@eecs.oregonstate.edu 1 Outline What is Machine Learning? Introduction to Supervised Learning: Linear Methods Overfitting, Regularization, and the

More information

Ensemble Methods for Machine Learning

Ensemble Methods for Machine Learning Ensemble Methods for Machine Learning COMBINING CLASSIFIERS: ENSEMBLE APPROACHES Common Ensemble classifiers Bagging/Random Forests Bucket of models Stacking Boosting Ensemble classifiers we ve studied

More information

Statistical Consulting Topics Classification and Regression Trees (CART)

Statistical Consulting Topics Classification and Regression Trees (CART) Statistical Consulting Topics Classification and Regression Trees (CART) Suppose the main goal in a data analysis is the prediction of a categorical variable outcome. Such as in the examples below. Given

More information

Discriminant analysis and supervised classification

Discriminant analysis and supervised classification Discriminant analysis and supervised classification Angela Montanari 1 Linear discriminant analysis Linear discriminant analysis (LDA) also known as Fisher s linear discriminant analysis or as Canonical

More information

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13 Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

Empirical Risk Minimization, Model Selection, and Model Assessment

Empirical Risk Minimization, Model Selection, and Model Assessment Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,

More information

Ensembles. Léon Bottou COS 424 4/8/2010

Ensembles. Léon Bottou COS 424 4/8/2010 Ensembles Léon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. Léon

More information

A Study of Relative Efficiency and Robustness of Classification Methods

A Study of Relative Efficiency and Robustness of Classification Methods A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics

More information

A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier

A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier Seiichi Ozawa 1, Shaoning Pang 2, and Nikola Kasabov 2 1 Graduate School of Science and Technology,

More information

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation

More information

Minimally-Sized Balanced Decomposition Schemes for Multi-Class Classification

Minimally-Sized Balanced Decomposition Schemes for Multi-Class Classification Minimally-Sized Balanced Decomposition Schemes for Multi-Class Classification Evgueni Smirnov, Matthijs Moed, Georgi Nalbantov, and Ida Sprinkhuizen-Kuyper Abstract Error-correcting output coding (ECOC)

More information