An ensemble uncertainty aware measure for directed hill climbing ensemble pruning

Size: px
Start display at page:

Download "An ensemble uncertainty aware measure for directed hill climbing ensemble pruning"

Transcription

1 Mach Learn (2010) 81: DOI /s An ensemble uncertainty aware measure for directed hill climbing ensemble pruning Ioannis Partalas Grigorios Tsoumakas Ioannis Vlahavas Received: 12 February 2009 / Revised: 4 March 2010 / Accepted: 8 March 2010 / Published online: 19 May 2010 The Author(s) 2010 Abstract This paper proposes a new measure for ensemble pruning via directed hill climbing, dubbed Uncertainty Weighted Accuracy (UWA), which takes into account the uncertainty of the decision of the current ensemble. Empirical results on 30 data sets show that using the proposed measure to prune a heterogeneous ensemble leads to significantly better accuracy results compared to state-of-the-art measures and other baseline methods, while keeping only a small fraction of the original models. Besides the evaluation measure, the paper also studies two other parameters of directed hill climbing ensemble pruning methods, the search direction and the evaluation dataset, with interesting conclusions on appropriate values. Keywords Ensemble pruning Ensemble selection Ensemble methods 1 Introduction Ensemble methods (Dietterich 2000) has been a very popular research topic during the last decade. Their success arises largely from the fact that they offer an appealing solution to several interesting learning problems of the past and the present, such as improving predictive performance, scaling inductive algorithms to large databases, learning from multiple physically distributed data sets and learning from concept-drifting data streams. Typically, ensemble methods comprise two phases: the production of multiple predictive models and their combination. Recent work (Banfield et al. 2005; Caruana et al. 2004; Editor: H. Blockeel. I. Partalas ( ) G. Tsoumakas I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece partalas@csd.auth.gr G. Tsoumakas greg@csd.auth.gr I. Vlahavas vlahavas@csd.auth.gr

2 258 Mach Learn (2010) 81: Margineantu and Dietterich 1997; Giacinto et al. 2000; Fan et al. 2002; Martinez-Munoz and Suarez 2004, 2006; Partalas et al. 2008; Tsoumakas et al. 2005), has considered an additional intermediate phase that deals with the reduction of the ensemble size prior to combination. This phase is commonly called ensemble pruning, while other names include selective ensemble, ensemble thinning and ensemble selection. Ensemble pruning is important for two reasons: efficiency and predictive performance. Having a very large number of models in an ensemble adds a lot of computational overhead. For example, decision tree models may have large memory requirements (Margineantu and Dietterich 1997) and lazy learning methods have a considerable computational cost during execution. The minimization of run-time overhead is crucial in certain applications, such as in stream mining. In addition, when models are distributed over a network, the reduction of models leads to the reduction of the important cost of communication. Equally important is the second reason, predictive performance. An ensemble may consist of both high and low predictive performance models. The latter may negatively affect the overall performance of the ensemble. Pruning these models while maintaining a high diversity among the remaining members of the ensemble is typically considered a proper recipe for an effective ensemble. The ensemble pruning problem can be posed as an optimization problem as follows: Find the subset of the original ensemble that optimizes a measure indicative of its generalization performance (for example accuracy on a separate validation set). As the number of subsets of an ensemble consisting of T models is 2 T 1 (the empty set is not accountable), exhaustive search becomes intractable for a moderate ensemble size. Several efficient methods that are based on a directed hill climbing search in the space of subsets report good predictive performance results (Banfield et al. 2005; Caruana et al. 2004; Margineantu and Dietterich 1997; Martinez-Munoz and Suarez 2004). These methods start with an empty (or full) initial ensemble and search the space of different ensembles by iteratively expanding (or contracting) the initial ensemble by a single model. The search is guided by an evaluation measure that is based on either the predictive performance or the diversity of the alternative subsets. The evaluation measure is the main component of a directed hill climbing algorithm and it differentiates the methods that fall into this category. The primary contribution of this work is a new measure for directed hill climbing ensemble pruning (DHCEP) that takes into account the uncertainty of the decision of the current ensemble. Empirical results on 30 data sets show that using the proposed measure to prune a heterogeneous ensemble leads to significantly better accuracy results compared to stateof-the-art measures and other baseline methods, while keeping only a small fraction of the original models. In addition, it is shown that the proposed measure maintains its lead across a variety of pruning levels. The secondary contribution of this work is an empirical study of the main parameters (search direction, evaluation dataset, evaluation measure) of DHCEP methods, leading to interesting conclusions on suitable settings. This paper extends our previous work (Partalas et al. 2008) in the following respects. First of all, it empirically compares a variety of different combinations of values for the main parameters of DHCEP methods, instead of specific instantiations. This has led to interesting new conclusions, such as the fact that the proposed measure significantly outperforms its rivals when used in a forward search direction. The comparison is based on a much larger number of datasets. Finally, this paper includes an extended description of the proposed measure elaborating on its key issues. The remainder of this paper is structured as follows: Sect. 2 presents background information on ensemble methods. Section 3 includes an extensive introduction to DHCEP methods and their main parameters. Section 4 presents the proposed measure. Section 5 describes the setup of the empirical study and Sect. 6 presents and discusses the results. Finally, Sect. 7 concludes this work and poses future research directions.

3 Mach Learn (2010) 81: Ensemble methods 2.1 Producing the models An ensemble can be composed of either homogeneous or heterogeneous models. Homogeneous models derive from different executions of the same learning algorithm by using different values for the parameters of the learning algorithm, injecting randomness into the learning algorithm or through the manipulation of the training instances, the input attributes and the model outputs (Dietterich 2000). Two popular methods for producing homogeneous models are bagging (Breiman 1996) and boosting (Schapire 1990). Heterogeneous models derive from running different learning algorithms on the same dataset. Such models have different views about the data, as they make different assumptions about them. For example, a neural network is robust to noise in contrast to a k-nearest neighbor classifier. In the empirical evaluation part of this paper we focus on ensembles consisting of models produced by running different learning algorithms, each with a variety of different parameter settings. 2.2 Combining the models A lot of different ideas and methods have been proposed in the past for the combination of classification models. The necessity for high classification performance in some critical domains (e.g. medical, financial, intrusion detection) has motivated researchers to explore methods that combine different classification algorithms in order to overcome the limitations of individual learning paradigms. Unweighted and Weighted Voting are two of the simplest methods for combining not only Heterogeneous but also Homogeneous models. In Voting, each model outputs a class value (or ranking, or probability distribution) and the class with the most votes (or the highest average ranking, or average probability) is the one proposed by the ensemble. In Weighted Voting, the classification models are not treated equally. Each model is associated with a coefficient (weight), usually proportional to its classification accuracy. Let x be an instance and m i, i = 1...k a set of models that output a probability distribution m i (x, c j ) for each class c j, j = 1...n. The output of the (weighted) voting method y(x) for instance x is given by the following mathematical expression: k y(x) = arg max w i m i (x, c j ), c j i=1 where w i is the weight of model i. In the simple case of voting (unweighted), the weights are all equal to one, that is, w i = 1,i = 1...k. 3 Ensemble pruning via directed hill climbing Hill climbing search greedily selects the next state to visit from the neighborhood of the current state. States, in our case, are the different subsets of the original ensemble H = {h t,t = 1, 2,...,T} of T models. The neighborhood of a subset of models S H comprises those subsets that can be constructed by adding or removing one model from S. We focus on the directed version of hill climbing that traverses the search space from one end (empty

4 260 Mach Learn (2010) 81: Fig. 1 An example of the search space of DHCEP methods for an ensemble of 4 models set) to the other (complete ensemble). An example of the search space for an ensemble of four models is presented in Fig. 1. The key design parameters that differentiate one DHCEP method from the other are (Tsoumakas et al. 2009): (a) the direction of search, (b) the measure and dataset used for evaluating the different branches of the search, and (c) the amount of pruning. The following sections discuss the different options for instantiating these parameters and the particular choices of existing methods. 3.1 Direction of search Based on the direction of search we have two main categories of DHCEP methods: (a) forward selection, and(b)backward elimination (see Fig. 1). In forward selection, the current classifier subset S is initialized to the empty set. The algorithm continues by iteratively adding to S the classifier h t H \ S that optimizes an evaluation function. In the past, this approach has been used in Fan et al. (2002), Martinez- Munoz and Suarez (2004), Caruana et al. (2004) and in the Reduced-Error Pruning with Backfitting (REPwB) method in Margineantu and Dietterich (1997). In backward elimination, the current classifier subset S is initialized to the complete ensemble H and the algorithm continues by iteratively removing from S the classifier h t S that optimizes an evaluation function. In the past, this approach has been used in the AID thinning and concurrency thinning algorithms (Banfield et al. 2005). In both cases, the traversal requires the evaluation of T(T+1) subsets, leading to a time 2 complexity of O(T 2 g(t,n)). Thetermg(T,N) concerns the complexity of the evaluation function, which is linear with respect to N and ranges from constant to quadratic with respect to T, as we shall see in the following sections. 3.2 Evaluation dataset The evaluation function scores the candidate subsets of models according to an evaluation measure that is calculated based on the predictions of its models on a set of data, which

5 Mach Learn (2010) 81: will be called the pruning set. The role of the pruning set can be performed by the training set, a separate validation set, or even a set of naturally existing or artificially produced instances with unknown value for the target variable. The pruning set will be denoted as D ={(x i,y i ), i = 1, 2,...,N}, wherex i is a vector with feature values and y i is the value of the target variable, which may be unknown. In the past, the training set has been used for evaluation in Martinez-Munoz and Suarez (2004). This approach offers the benefit that plenty of data will be available for evaluation and training, but is susceptible to the danger of overfitting. Withholding a part of the training set for evaluation, has been used in the past in Caruana et al. (2004) and Banfield et al. (2005) and in the REPwB method in Margineantu and Dietterich (1997). This approach is less prone to overfitting, but reduces the amount of data available for training and evaluation compared to the previous approach. It sacrifices both the predictive performance of the ensemble s members and the quantity of the evaluation data for the sake of using unseen data in the evaluation. This method should probably be preferred over the previous one, when there is abundance of training data. An alternative approach that has been used in Caruana et al. (2006), is based on k-fold cross-validation. For each fold an ensemble is created using the remaining folds as the training set. The same fold is used as the pruning set for models and subensembles of this ensemble. Finally, the evaluations are averaged across all folds. This approach is less prone to overfitting as the evaluation of models is based on data that were not used for their training and at the same time, the complete training dataset is used for evaluation. During testing the above approach works as follows: The k models that were trained using the same procedure (same algorithm, same subset, etc.) form a cross-validated model. When the cross-validated model makes a prediction for an instance, it averages the predictions of the individual models. An alternative testing strategy that we suggest for the above approach is to train an additional single model from the complete training set and use this single model during testing. 3.3 Evaluation measure The main component that differentiates DHCEP methods is the evaluation measure. Evaluation measures can be grouped into two major categories: those that are based on performance and those on diversity Performance-based measures The goal of performance-based measures is to find the model that maximizes the performance of the ensemble produced by adding (removing) a model to (from) the current ensemble. Their calculation depends on the method used for ensemble combination, which usually is voting. Accuracy was used as an evaluation measure in Margineantu and Dietterich (1997) and Fan et al. (2002), while Caruana et al. (2004) experimented with several metrics, including accuracy, root-mean-squared-error, mean cross-entropy, lift, precision/recall break-even point, precision/recall F-score, average precision and ROC area. Another measure is benefit, which is based on a cost model and has been used in Fan et al. (2002). The calculation of performance-based metrics requires the decision of the ensemble on all examples of the pruning set. Therefore, the complexity of these measures is O( S N). However, this complexity can be optimized to O(N), if the predictions of the current ensemble are updated incrementally each time a classifier is added to/removed from it.

6 262 Mach Learn (2010) 81: Diversity-based measures It is generally accepted that an ensemble should contain diverse models in order to achieve high predictive performance. However, there is no clear definition of diversity, nor a single measure to calculate it. In their interesting study, Kuncheva and Whitaker (2003), could not reach a solid conclusion on how to utilize diversity for the production of effective classifier ensembles. In a more recent theoretical and experimental study on diversity measures (Tang et al. 2006), the authors reached the conclusion that diversity cannot be explicitly used for guiding the process of directed hill climbing methods. Yet, certain approaches have reported promising results (Martinez-Munoz and Suarez 2004; Banfield et al. 2005). One issue that is worth mentioning here is how to calculate the diversity during the search in the space of ensemble subsets. For simplicity we consider the case of forward selection only. Let S be the current ensemble and h t H \ S a candidate classifier to add to the ensemble. One could compare the diversities of subensembles S = S h t for all candidate h t H \ S and select the ensemble with the highest diversity. Any pairwise and non-pairwise diversity measure can be used for this purpose (Kuncheva and Whitaker 2003). Pairwise measures calculate the diversity between two models. The diversity of an ensemble of models can be calculated as the mean pairwise diversity of all models in the ensemble. The time complexity of this process is typically O( S 2 N). Non-pairwise diversity measures can directly calculate the diversity of an ensemble of models and their time complexity is typically O( S N). A straightforward optimization can be performed in the case of pairwise diversity measures. Instead of calculating the sum of the pairwise diversity for every pair of classifiers in each candidate ensemble S, one can simply calculate the sum of the pairwise diversities only for the pairs that include the candidate classifier h t. The sum of the rest of the pairs is equal for all candidate ensembles. The same optimization can be achieved in backward elimination too. This reduces their time complexity to O( S N). Several methods use a different approach to calculate diversity during the search. They use pairwise measures to compare the candidate classifier h t with the current ensemble S, which is viewed as a single classifier that combines the decisions of its members with voting. This way they calculate the diversity between the current ensemble as a whole and the candidate classifier. Such an approach has time complexity O( S N), which can be optimized to O(N), if the predictions of the current ensemble are updated incrementally each time a classifier is added to/removed from it. However, these calculations do not take into account the decisions of individual models. In the past, the widely known pairwise diversity measures disagreement, double fault, Kohavi-Wolpert variance, inter-rater agreement, generalized diversity and difficulty were used for DHCEP in Tang et al. (2006). Complementariness (Martinez-Munoz and Suarez 2004) and concurrency (Banfield et al. 2005) are two diversity measures designed specifically for ensemble pruning via directed hill climbing. We next introduce some additional notation to uniformly present these two methods. We can distinguish four events concerning the decision of a classifier h andanensemble of models S with respect to an example (x i,y i ): e tf (h, S, x i,y i ) : h(x i ) = y i S(x i ) y i e ft (h, S, x i,y i ) : h(x i ) y i S(x i ) = y i e tt (h, S, x i,y i ) : h(x i ) = y i S(x i ) = y i e ff (h, S, x i,y i ) : h(x i ) y i S(x i ) y i

7 Mach Learn (2010) 81: The complementariness of a model h with respect to an ensemble S and a pruning set D is calculated as follows: COM D (h, S) = N I(e tf (h, S, x i,y i )), i=1 where I(true) = 1, I(false) = 0. The complementariness of a model with respect to an ensemble is actually the number of examples of D that are classified correctly by the model and incorrectly by the ensemble. A pruning algorithm that uses the above measure, tries to add (remove) at each step the model that helps the current ensemble classify correctly the examples it gets wrong. Note that in the backward case the removed model is the one that minimizes the measure. The concurrency of a model h with respect to an ensemble S and a pruning set D is calculated as follows: CON D (h, S) = N ( 2I(etf (h, S, x i,y i )) + I(e tt (h, S, x i,y i )) 2I(e ff (h, S, x i,y i )) ) i=1 This measure is similar to complementariness, with the difference that it takes into account two extra events and weights them. No specific argument is given for this particular choice of events and weights. The margin distance minimization method (Martinez-Munoz and Suarez 2004) (also specifically designed for DHCEP) follows a different approach for calculating the diversity, which implicitly takes into account the decisions of individual models. For each classifier h t an N-dimensional vector, c t, is defined where each element c t (i) is equal to 1 if the t th classifier classifies correctly example i of the pruning set, and 1 otherwise.the S t=1 c t. vector, C S of the ensemble S is the average of the individual vectors c t, C S = 1 S When S classifies correctly all the instances the corresponding vector is in the first quadrant of the N-dimensional hyperplane. The objective is to reduce the Euclidean distance, d(o,c S ), of the current ensemble C S from a predefined vector with the same components, o i = p,i = 1,...,N,0 <p<1, placed in the first quadrant of the N-dimensional hyperplane. The value of p is usually between 0.05 and The proposed margin diversity measure, MAR D (h t,s), of a classifier h t with respect to an ensemble S and a pruning set D is calculated as follows: 3.4 Amount of pruning ( MAR D (h t,s)= d o, ) 1 S +1 (c t + C S ) Another issue that pertains to almost all ensemble pruning methods concerns the size of the final ensemble. Two main approaches are followed with respect to this issue: (a) use a fixed user-specified amount or percentage of models, and (b) dynamically select the size based on the predictive performance of candidate ensembles of different size. In the second case the predictive performance of the ensembles encountered during the complete search process from the one end of the search space to the other is recorded and the ensemble with the best performance is selected. If the goal of pruning is to improve efficiency, then the former approach can be used in order to achieve the desired number of models, which may be dictated by constraints (memory and speed) in the application environment. If the goal

8 264 Mach Learn (2010) 81: of pruning is to improve performance, then the latter approach can be used, as it is more flexible and can sacrifice efficiency for effectiveness. 4 The proposed measure We here propose a new measure starting from the common notation that was introduced in Sect. 3.3 to describe the complementariness and concurrency measures. For simplicity of presentation we slightly abuse the notation by dropping symbols x i and y i. First of all we note that complementariness is based on event e tf (h, S) only. However, the rest of the events are also plausible indicators of the utility of a candidate classifier h with respect to an ensemble S. For example, event e tt (h, S) should also add to the utility of h, though potentially not as much as e tf (h, S). This is reflected in the concurrency measure, which explicitly takes into account three of the events (implicitly the fourth one as well, since e ft (h, S) = 1 e tt (h, S) e ff (h, S) e tf (h, S)), each with a different weight. Concurrency is considering positively the event e tf (h, S), negatively the event e ff (h, S), positively with half the weight of the previous events the event e tt (h, S) and neutrally the event e ft (h, S). The contribution of each of the events to the heuristic is rather ad-hoc, as there is no theoretical justification for the particular choice of weights. More important is the fact that, as briefly mentioned in the previous section, concurrency, complementariness and all other diversity measures that are computed based on the decision of the candidate model and the decision of the ensemble as a whole, are agnostic of the decisions of the individual models of the ensemble. We consider this a major limitation of these measures, and justify our claim with an example. Consider two members of the pruning set (x i,y i ) and (x j,y j ) that are wrongly classified by candidate classifier h. The first one is wrongly classified by 49% of the members of the ensemble S, while the second one by just 10%. In both cases event e ft (h, S) is the true one. Should these examples contribute the same value to the measure? The rational answer is no. In the first case, the uncertainty of the ensemble is high, while in the second it is low. In a forward selection scenario, the probability that the ensemble will wrongly classify example i in the future, if it adds h to S, is far greater compared to the probability that it will wrongly classify j. The above issues motivated us to propose a new measure that takes into account the uncertainty of the ensemble s decision, and at the same time has clear and justified semantics. The following quantities are introduced to allow its definition: NT i, which denotes the proportion of models in the current ensemble S that classify example (x i,y i ) correctly, and NF i = 1 NT i, which denotes the proportion of models in S that classify it incorrectly. The proposed measure, dubbed Uncertainty Weighted Accuracy (UWA), is defined as follows: UWA(h, S) = N ( I(etf (h, S, x i,y i ))NT i I(e ft (h, S, x i,y i ))NF i i=1 + I(e tt (h, S, x i,y i ))NF i I(e ff (h, S, x i,y i ))NT i ) First of all, note that events e tf and e tt increase the metric, because the candidate classifier is correct, while events e ft and e ff decrease it, as the candidate classifier is incorrect. The strength of increase/decrease depends on the uncertainty of the ensemble s decision. If the current ensemble S is incorrect, then the reward/penalty is multiplied by the proportion of correct models in S. On the other hand, if S is correct, then the reward/penalty is multiplied by the proportion of incorrect models in S.

9 Mach Learn (2010) 81: This complex, at first sight, weighting scheme, actually represents a simple rule: examples for which the ensemble s decision is highly uncertain should influence the metric stronger, while examples where most of the ensemble s members agree should not influence the metric a lot. The rationale of this rule is the following: When most members of the ensemble agree, then this is either a very easy (if the ensemble is correct), or a very hard (if it is wrong) example. Rewarding a candidate classifier that correctly classifies an easy example is of no real value, as is penalizing it for erring on a very hard example. On the other hand, when the ensemble is marginally correct or incorrect, then the decision of the candidate classifier is more important, as it may correct an incorrect decision of the ensemble, or the other way round. To see exactly how the measure represents the above rule, let s examine each specific event separately, in a forward selection scenario. In event e tf, the addition of a correct classifier when the ensemble is wrong contributes a reward proportional to the number of correct classifiers in that ensemble. The rationale is that if the number of correct classifiers is small, then correct classification of this example is hard to achieve and thus the addition of this classifier will not have an important impact on the ensemble s performance. On the other hand, if the number of correct classifiers is large, then the example is marginal and thus the impact of the addition of this classifier is significant. Inevent e ft, the addition of an erring classifier when the ensemble is correct contributes a penalty proportional to the number of erring classifiers in that ensemble. The rationale is that if the number of erring classifiers is small, then the addition of another erring classifier will not influence the ensemble significantly, while if the number of erring classifiers is large, then the example is marginal and thus the classifier could change the ensemble s decision from correct to wrong. In event e tt, the addition of a correct classifier when the ensemble is correct contributes a reward proportional to the number of erring classifiers in that ensemble. The rationale is that if the number of erring classifiers is small, then the addition of a correct classifier is not really very useful. The higher the number of erring classifiers the more important is the addition of this candidate classifier. In event e ff, the addition of an erring classifier when the ensemble is wrong contributes a penalty proportional to the number of correct classifiers in that ensemble. The rationale is that if the number of correct classifiers is small, then the addition of another erring classifier will not influence the ensemble significantly, as this is a hard to correctly classify example. If the number of correct classifiers is large, then the example is marginal and thus the classifier has a negative effect to the ensemble as it moves it further away from the margin. We next give a concrete example, in a forward selection scenario, in order to clarify how the proposed measure works. Consider an ensemble that contains 10 classifiers h 1 to h 10 and two candidate classifiers h 1 and h 2.Tables1 and 2 show whether each of these classifiers is correct or incorrect for a pruning set containing 5 examples x 1 to x 5. A correct decision is indicated by a plus (+), and a wrong decision by a minus ( ). For the ensemble, example x 1 is a difficult one, x 2 an easy, while the rest of the examples are marginal. The last column of Table 2 shows the value of the proposed measure. The events that characterize h 1 in relation to examples x 1 to x 5 are e tf, e ft, e tt, e ft and e ff respectively. Consequently, the contributions to UWA are NT 1 NF 2 + NF 3 NF 4 NT 5, giving = 0.4. The reward from correctly classifying x 1 is small, as it is a very easy example. The classifier is penalized stronger for the incorrect classification of marginal examples x 4 and x 5, compared to the hard example x 2.Theevents

10 266 Mach Learn (2010) 81: Table 1 An ensemble of 10 classifiers and the decision of its models for the 5 instances of the pruning set x 1 x 2 x 3 x 4 x 5 h h h h h h h h h h NT NF Table 2 Two candidate classifiers, their decisions for the 5 instances of the pruning set and the value of the proposed measure x 1 x 2 x 3 x 4 x 5 UWA h h that characterize h 2 in relation to examples x 1 to x 5 are e ff, e ft, e ft, e tt and e tf respectively. Consequently, the contributions to UWA are NT 1 NF 2 NF 3 + NF 4 + NT 5,giving = 0. The reward from correctly classifying the marginal examples x 4 and x 5 is large, as is the penalty from incorrect classification of marginal example x 3. In contrast, the incorrect classification of the very hard and very easy examples x 1 and x 2 incurs a small penalty. Despite that both candidate classifiers make the same number of errors, overall h 2 is more beneficial to the current ensemble, which is reflected in a higher value of UWA. It is this candidate classifier, that would be added to the ensemble in this forward selection scenario that we examined. Concluding this section, we explain how the name of the proposed measure was conceived. If we dropped the weights from the measure, then it would be proportional to the accuracy of the candidate classifier, since it would give a reward (penalty) of 1 whenever there is a correct (incorrect) candidate classifier decision, irrespectively of the ensemble s decision. The weights correspond to the ensemble s uncertainty, hence uncertainty weighted accuracy. 5 Experimental setup This section presents information about the datasets that were used for conducting the experiments, the classifiers that comprise the initial ensemble, the ensemble pruning methods that participate in the comparison, as well as the experimentation methodology that was followed.

11 Mach Learn (2010) 81: Table 3 Details of data sets: folder in UCI server, number of instances, classes, continuous and discrete attributes, percentage of missing values id UCI folder Inst Cls Cnt Dsc MV(%) d1 Anneal d2 Balance-scale d3 Breast-w d4 Car d5 Cmc d6 Colic d7 Credit-g d8 Dermatology d9 Ecoli d10 Haberman d11 Heart-h d12 Heart-statlog d13 Hill d14 Hypothyroid d15 Ionosphere d16 Kr-vs-kp d17 Mammographic d18 Mfeat-morphological d19 Page-blocks d20 Primary-tumor d21 Segment d22 Sick d23 Sonar d24 Soybean d25 Spambase d26 Tic-tac-toe d27 Vehicle d28 Vote d29 Vowel d30 Waveform The source code we developed for conducting the experiments along with a package for performing statistical tests on multiple datasets, can be found at the following URL: Datasets We experimented on 30 data sets from the UCI Machine Learning repository (Asuncion and Newman 2007). Table 3 presents the details of these data sets (folder in UCI server, number of instances, classes, continuous and discrete attributes, percentage of missing values). We avoided using datasets with a very small number of examples, so that an adequate amount of data is available for training, evaluation and testing.

12 268 Mach Learn (2010) 81: Ensemble construction We constructed a heterogeneous ensemble of 200 models, by running different learning algorithms with different parameters on the training set. The WEKA machine learning library (Witten and Frank 2005) was used as the source of learning algorithms. We trained 40 multilayer perceptrons (MLPs), 60 k Nearest Neighbors (knns), 80 support vector machines (SVMs) and 20 decision trees (DT) using the C4.5 algorithm. The different parameters used to train the algorithms were the following (default values were used for the rest of the parameters): MLPs: we used 5 values for the nodes in the hidden layer {1, 2, 4, 8, 16}, 4 values for the momentum term {0.0, 0.2, 0.5, 0.9} and 2 values for the learning rate {0.3, 0.6}. knns:weused20valuesfork distributed evenly between 1 and the plurality of the training instances. We also used 3 weighting methods: no-weighting, inverse-weighting and similarity-weighting. SVMs: we used 8 values for the complexity parameter {10 5,10 4,10 3,10 2,0.1,1, 10, 100}, and 10 different kernels. We used 2 polynomial kernels (of degree 2 and 3) and 8 radial kernels (gamma {0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2}). Decision trees: We constructed 10 trees using postpruning with 5 values for the confidence factor {0.1, 0.2, 0.3, 0.5 } and 2 values for Laplace smoothing {true, false}, 8 trees using reduced error pruning with 4 values for the number of folds {2, 3, 4, 5} and 2 values for Laplace smoothing {true, false}, and 2 unpruned trees using 2 values for the minimum objects per leaf {2, 3}. 5.3 Ensemble pruning methods We compare 16 instantiations of the general DHCEP algorithm, that arise by using all combinations of 4 different values for the evaluation measure parameter and 2 different values for each of the evaluation dataset and search direction parameters. As far as the evaluation measure parameter is concerned, we used the proposed measure, UWA, and the following performance and diversity based measures: Accuracy (ACC) (Caruana et al. 2004), Complementariness (COM) (Martinez-Munoz and Suarez 2004) and Concurrency Thinning (CON) (Banfield et al. 2005). The search was performed in both the forward (F) and the backward (B) directions. The pruning set was instantiated to: (a) the training set (T), which means that all available data are used for both building the models and pruning the ensemble, and (b) a separate validation set (V). In addition to the above 16 methods we implemented two methods that do not use any diversity measures and prune the ensemble according to a fixed order. The first one is called Random Ordering (RO), which randomly orders the classifiers and the second one is called Greedy Ordering (GO), which orders the classifiers according to their accuracy on the pruning set. Additionally, we implemented two baseline methods, corresponding to two extreme pruning scenarios. The first one selects the best single model (BSM) in the ensemble, according to the performance of the models on the pruning set, while the second one retains all models of the ensemble (ALL). Note that for RO, GO, BSM and ALL, the direction parameter has no meaning and is neglected. Additionally, note that ALL does not require a pruning set, as it does not include any selection process. For this reason, the performance of ALL is calculated only for the case where all available training data are used for training the models of the ensemble. Voting was used for model combination in all aforementioned algorithms. Additionally, all the algorithms follow the dynamic approach in Caruana et al. (2004), which selects the

13 Mach Learn (2010) 81: Table 4 Acronym, search direction, evaluation dataset and evaluation measure for the different DHCEP, ordering and baseline pruning methods Acronym Search direction Evaluation dataset Evaluation measure FTACC Forward Training set ACCuracy FTCON Forward Training set CONcurrency FTCOM Forward Training set COMplementariness FTUWA Forward Training set Uncertainty Weighted Accuracy FVACC Forward Validation set ACCuracy FVCON Forward Validation set CONcurrency FVCOM Forward Validation set COMplementariness FVUWA Forward Validation set Uncertainty Weighted Accuracy BTACC Backward Training set ACCuracy BTCON Backward Training set CONcurrency BTCOM Backward Training set COMplementariness BTUWA Backward Training set Uncertainty Weighted Accuracy BVACC Backward Validation set ACCuracy BVCON Backward Validation set CONcurrency BVCOM Backward Validation set COMplementariness BVUWA Backward Validation set Uncertainty Weighted Accuracy TGO Training set TRO Training set TBSM Training set VGO Validation set VRO Validation set VBSM Validation set ensemble, during the pruning procedure, with the highest accuracy on the pruning set, instead of using an arbitrary percentage of models. Table 4 shows the acronyms that will be used in the rest of this paper for the different instantiations of the parameters of the general DHCEP algorithm, RO, GO and BSM. 5.4 Methodology Initially, each dataset is split into three disjunctive parts: D 1, D 2 and D 3, consisting of 60%, 20% and 20% respectively. In the case where a separate validation set is used for pruning, D 1 is used for training the models and D 2 for performing the pruning procedure. In the other case, D 1 D 2 is used for both training and pruning. D 3 is always used solely for testing the methods. The experiment described in this section is performed 10 times for each dataset using a different randomized ordering of its examples. All reported results are averages over these 10 repetitions. 6 Results and discussion According to Demsar (2006) the appropriate way to compare the effectiveness of multiple algorithms on multiple datasets is based on their average rank across all datasets. On each

14 270 Mach Learn (2010) 81: Table 5 Average rank, ensemble size and type of the selected models of each method across all datasets Method Avg. rank Avg. size MLP knn SVM DT FVUWA FVCON FVCOM FVACC BVUWA BVCON BVCOM BVACC FTUWA FTCON FTCOM FTACC BTUWA BTCON BTCOM BTACC VRO VGO VBSM TRO TGO TBSM ALL dataset, the algorithm with the best performance gets rank 1.0, the one with the second best performance gets rank 2.0 and so on. In case two or more algorithms tie, they all receive the average of the ranks that correspond to them. Table 5 shows the average rank (based on classification accuracy) along with the average size and composition (type of models) of the pruned ensemble, for each method participating in the experiments. DHCEP methods are grouped first by pruning set, then by search direction and finally sorted by evaluation measure. The table continues with RO, GO and BSM grouped by pruning set, and ends with the ALL method. Detailed tables with the classification accuracy, rank and ensemble size of all methods in all datasets can be found in Appendix (Tables 6 9). We first study the evaluation dataset parameter and its relation to the effectiveness of the ensemble pruning methods. We observe that using a separate validation set leads to better results than using the training set, for all methods. In order to investigate whether the differences in accuracy are statistically significant, we conducted 11 Wilcoxon signed rank tests (Wilcoxon 1945), one for each pair of methods that differ in terms of the evaluation dataset. At a confidence level of 95% all the tests reported significant differences in favor of using a separate validation set. This shows that using unseen data to guide the pruning process is very important, despite the fact that it comes at the expense of available training data for the models of the ensemble. When the training set is used as the pruning set, the pruning process is based on model predictions for data that are known to the models, and is

15 Mach Learn (2010) 81: therefore biased towards model subsets that overfit the training data. The negative effect of overfitting is stronger for methods that select a small number of models, such as the baseline BSM method and the DHCEP methods that search in the forward direction. We next study the effect of the search direction parameter. In light of the results concerning the evaluation dataset, we exclude methods that use the training set from this discussion. We first notice that forward selection leads to better results compared to backward elimination for all evaluation measures. In order to investigate whether the differences in accuracy are statistically significant, we conducted 4 Wilcoxon signed rank tests, one for each pair of DHCEP methods that use a separate validation set for pruning and differ in terms of the search direction. At a confidence level of 95% all tests show significant differences in favor of the forward direction, apart from the pair of BVCON and FVCON. As far as the size of the pruned ensemble is concerned, we observe that those DHCEP methods that search in the forward direction tend to produce much smaller ensembles than those that search in the backward direction. More specifically, the FVxxx methods achieve an average reduction of 96.33% of the initial ensemble, while the BVxxx ones achieve an average reduction of 81.82%. Note that this trend also holds for DHCEP methods that use the training set to guide the pruning process. Based on these results, the recommended search direction for DHCEP methods is the forward one. Even though it does not bring a statistically significant benefit for one of the measures (CON), it manages to reduce the initial ensemble substantially more compared to the backward direction. Note that this conclusion assumes the use of a separate validation set for evaluation. This is an interesting outcome and can be explained by considering the fact that DHCEP methods are based on a comparison of the decisions of a candidate model and an ensemble of models. In the backward direction, DHCEP methods remove those models that are not helpful compared to the current ensemble according to an evaluation measure. However, the current ensemble is initially suboptimal, as it contains all models, both good and bad ones. Therefore, there is a high probability that a good model will be removed in the beginning, just because it does not improve the current suboptimal ensemble. On the other hand, in the forward direction the initial ensemble is empty, and is progressively expanded according to the evaluation measure, so it always contain good models. This explains both the higher performance and the smaller size of the DHCEP methods that search in the forward direction. We continue the analysis of the results with a study of the evaluation measure parameter. As before, we exclude methods that use the training set from this discussion. We first notice that the proposed measure, UWA leads to the two best overall results, independently of the search direction. We next proceed to an investigation of whether the proposed measure is significantly better compared to the rest. Taking into account the conclusion concerning the search direction parameter, we performed a Wilcoxon signed rank test between FVUWA and each of FVCON, FVCOM and FVACC. We also performed a Wilcoxon signed rank test between FVUWA and each of VGO, VRO, VBSM and ALL. At a confidence level of 95% all the tests show that FVUWA performs significantly better than its competitors. This shows that taking into account the uncertainty of the ensemble s decision can lead to substantially better results in terms of accuracy. The size of the pruned ensemble is approximately the same as that returned by DHCEP methods using other evaluation measures. It is also interesting to look at the composition of the pruned ensembles, as shown in the last four columns of Table 5. Focusing on DHCEP methods that search in the forward search direction and use a separate validation set for evaluation, we notice that they produce ensembles with a balanced mixture of different types of models. Different types of models

16 272 Mach Learn (2010) 81: Fig. 2 Performance of the FVxxx methods during the pruning phase. The ranking procedure is performed for the different sizes of the ensemble between 1 and 200 lead to more diverse ensembles, and as a result more accurate ensembles. It is also interesting to look into how the composition of models affects the performance of these 4 methods. It seems that more SVMs and MLPs and less DTs and knn models, lead to better results. MLPs and SVMs are high performance classifiers and thus they dominate the ensemble. Additionally, the selection of DTs and knns seems to add to the overall diversity, and such models should be present in the pruned ensemble. So far, we have looked at how methods behave under the setting that the percentage of pruning is automatically determined by the methods themselves, which as we have discussed in Sect. 3 is suitable when the main motivation of the pruning process is to increase the prediction effectiveness. However, if we were primarily interested in assessing the efficiency of the methods, then we should evaluate their accuracy under different percentages of pruning. Still, we should note that the automated pruning methods already achieve a remarkable pruning level, thus they meet the goal of efficiency too. Figure 2 depicts the average rank of the FVxxx methods during the pruning process. The curves are produced by calculating the ranks for the different sizes of the ensemble between 1 and 200. It is interesting to note that the method using the proposed measure (FVUWA) is the best one for the largest part of the graph. Especially in the area between 2 and 40 models, it clearly outperforms its rivals. These findings show that it can be used for guiding a pruning process, in situations where computational and storage savings is the dominant motivation. Another interesting observation is that although FVCOM has a better average rank compared to FVACC based on the automatic selection of the amount of pruning, it is much worse than FVACC under the same level of pruning for the largest part of the graph.

17 Mach Learn (2010) 81: Conclusions and future work This paper presented a new measure for DHCEP, called Uncertainty Weighted Accuracy (UWA), which takes into account the uncertainty of the decisions of the current ensemble. We compared UWA against state-of-the-art measures using different values for the parameters of search direction and evaluation dataset on ensembles of heterogeneous models. The empirical comparison was carried out on 30 datasets and included 4 additional baseline ensemble pruning methods. The results show that the proposed measure leads to significantly better accuracy results compared to its rivals and it also manages to reduce substantially the size of the original ensemble and thus to minimize the computational complexity. Several interesting conclusions came up. To begin with, the evaluation dataset parameter plays an important role on the performance of the DHCEP methods. The use of a separate set leads to significantly better results than using all the available data. Also, the direction of search is another important parameter that influences the performance of DHCEP methods. The results showed that the forward direction helps to significantly improve the accuracy. One interesting future work direction concerns the composition of the pool of models that constitute the initial ensemble. It is interesting to investigate which parameters of the algorithms have proved effective or not (for example a neural network with 10 hidden nodes) and to use this information in order to substitute the specific models with other more effective ones. Another related interesting direction concerns the development of a method that can train and add models in the ensemble incrementally in an active learning fashion. In other words to grow the ensemble dynamically by actively selecting the next most appropriate model to train and/or perhaps dynamically removing inappropriate models that were previously added. This will save the computational costs of training a large number of models from the beginning. We would like to sincerely thank the anonymous reviewers for their helpful and to- Acknowledgements the-point comments.

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Diversity Regularized Ensemble Pruning

Diversity Regularized Ensemble Pruning Diversity Regularized Ensemble Pruning Nan Li 1,2, Yang Yu 1, and Zhi-Hua Zhou 1 1 National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210046, China 2 School of Mathematical

More information

Ensemble Pruning via Individual Contribution Ordering

Ensemble Pruning via Individual Contribution Ordering Ensemble Pruning via Individual Contribution Ordering Zhenyu Lu, Xindong Wu +, Xingquan Zhu @, Josh Bongard Department of Computer Science, University of Vermont, Burlington, VT 05401, USA + School of

More information

Lecture 3: Decision Trees

Lecture 3: Decision Trees Lecture 3: Decision Trees Cognitive Systems - Machine Learning Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning last change November 26, 2014 Ute Schmid (CogSys,

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,

More information

Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

More information

arxiv: v2 [cs.lg] 21 Feb 2018

arxiv: v2 [cs.lg] 21 Feb 2018 Vote-boosting ensembles Maryam Sabzevari, Gonzalo Martínez-Muñoz and Alberto Suárez Universidad Autónoma de Madrid, Escuela Politécnica Superior, Dpto. de Ingeniería Informática, C/Francisco Tomás y Valiente,

More information

An Empirical Study of Building Compact Ensembles

An Empirical Study of Building Compact Ensembles An Empirical Study of Building Compact Ensembles Huan Liu, Amit Mandvikar, and Jigar Mody Computer Science & Engineering Arizona State University Tempe, AZ 85281 {huan.liu,amitm,jigar.mody}@asu.edu Abstract.

More information

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Günther Eibl and Karl Peter Pfeiffer Institute of Biostatistics, Innsbruck, Austria guenther.eibl@uibk.ac.at Abstract.

More information

Diversity-Based Boosting Algorithm

Diversity-Based Boosting Algorithm Diversity-Based Boosting Algorithm Jafar A. Alzubi School of Engineering Al-Balqa Applied University Al-Salt, Jordan Abstract Boosting is a well known and efficient technique for constructing a classifier

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

AE = q < H(p < ) + (1 q < )H(p > ) H(p) = p lg(p) (1 p) lg(1 p)

AE = q < H(p < ) + (1 q < )H(p > ) H(p) = p lg(p) (1 p) lg(1 p) 1 Decision Trees (13 pts) Data points are: Negative: (-1, 0) (2, 1) (2, -2) Positive: (0, 0) (1, 0) Construct a decision tree using the algorithm described in the notes for the data above. 1. Show the

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

Neural Networks and Ensemble Methods for Classification

Neural Networks and Ensemble Methods for Classification Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks A neural network is a set of connected input/output units (neurons) where each connection has a weight associated

More information

LBR-Meta: An Efficient Algorithm for Lazy Bayesian Rules

LBR-Meta: An Efficient Algorithm for Lazy Bayesian Rules LBR-Meta: An Efficient Algorithm for Lazy Bayesian Rules Zhipeng Xie School of Computer Science Fudan University 220 Handan Road, Shanghai 200433, PR. China xiezp@fudan.edu.cn Abstract LBR is a highly

More information

Learning from Examples

Learning from Examples Learning from Examples Data fitting Decision trees Cross validation Computational learning theory Linear classifiers Neural networks Nonparametric methods: nearest neighbor Support vector machines Ensemble

More information

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning Question of the Day Machine Learning 2D1431 How can you make the following equation true by drawing only one straight line? 5 + 5 + 5 = 550 Lecture 4: Decision Tree Learning Outline Decision Tree for PlayTennis

More information

Top-k Parametrized Boost

Top-k Parametrized Boost Top-k Parametrized Boost Turki Turki 1,4, Muhammad Amimul Ihsan 2, Nouf Turki 3, Jie Zhang 4, Usman Roshan 4 1 King Abdulaziz University P.O. Box 80221, Jeddah 21589, Saudi Arabia tturki@kau.edu.sa 2 Department

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Pareto Ensemble Pruning

Pareto Ensemble Pruning Pareto Ensemble Pruning Chao Qian and Yang Yu and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University Collaborative Innovation Center of Novel Software Technology and

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

More information

Chapter 6: Classification

Chapter 6: Classification Chapter 6: Classification 1) Introduction Classification problem, evaluation of classifiers, prediction 2) Bayesian Classifiers Bayes classifier, naive Bayes classifier, applications 3) Linear discriminant

More information

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1 Decision Trees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 5 th, 2007 2005-2007 Carlos Guestrin 1 Linear separability A dataset is linearly separable iff 9 a separating

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Learning = improving with experience Improve over task T (e.g, Classification, control tasks) with respect

More information

Voting Massive Collections of Bayesian Network Classifiers for Data Streams

Voting Massive Collections of Bayesian Network Classifiers for Data Streams Voting Massive Collections of Bayesian Network Classifiers for Data Streams Remco R. Bouckaert Computer Science Department, University of Waikato, New Zealand remco@cs.waikato.ac.nz Abstract. We present

More information

Lecture 3: Decision Trees

Lecture 3: Decision Trees Lecture 3: Decision Trees Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning Lecture 3: Decision Trees p. Decision

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Induction of Decision Trees

Induction of Decision Trees Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification

More information

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014 Decision Trees Machine Learning CSEP546 Carlos Guestrin University of Washington February 3, 2014 17 Linear separability n A dataset is linearly separable iff there exists a separating hyperplane: Exists

More information

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel Lecture 2 Judging the Performance of Classifiers Nitin R. Patel 1 In this note we will examine the question of how to udge the usefulness of a classifier and how to compare different classifiers. Not only

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

On Multi-Class Cost-Sensitive Learning

On Multi-Class Cost-Sensitive Learning On Multi-Class Cost-Sensitive Learning Zhi-Hua Zhou, Xu-Ying Liu National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China {zhouzh, liuxy}@lamda.nju.edu.cn Abstract

More information

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007 Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

Predicting the Probability of Correct Classification

Predicting the Probability of Correct Classification Predicting the Probability of Correct Classification Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract We propose a formulation for binary

More information

Voting (Ensemble Methods)

Voting (Ensemble Methods) 1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

More information

Decision Trees: Overfitting

Decision Trees: Overfitting Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Name (NetID): (1 Point)

Name (NetID): (1 Point) CS446: Machine Learning (D) Spring 2017 March 16 th, 2017 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains

More information

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University Decision Tree Learning Mitchell, Chapter 3 CptS 570 Machine Learning School of EECS Washington State University Outline Decision tree representation ID3 learning algorithm Entropy and information gain

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

A Simple Algorithm for Learning Stable Machines

A Simple Algorithm for Learning Stable Machines A Simple Algorithm for Learning Stable Machines Savina Andonova and Andre Elisseeff and Theodoros Evgeniou and Massimiliano ontil Abstract. We present an algorithm for learning stable machines which is

More information

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology Decision trees Special Course in Computer and Information Science II Adam Gyenge Helsinki University of Technology 6.2.2008 Introduction Outline: Definition of decision trees ID3 Pruning methods Bibliography:

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

More information

the tree till a class assignment is reached

the tree till a class assignment is reached Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal

More information

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Nonlinear Classification

Nonlinear Classification Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions

More information

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks. Nicholas Ruozzi University of Texas at Dallas Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify

More information

Artificial Intelligence Roman Barták

Artificial Intelligence Roman Barták Artificial Intelligence Roman Barták Department of Theoretical Computer Science and Mathematical Logic Introduction We will describe agents that can improve their behavior through diligent study of their

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Goals for the lecture you should understand the following concepts the decision tree representation the standard top-down approach to learning a tree Occam s razor entropy and information

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

Abstract. 1 Introduction. Ian H. Witten Department of Computer Science University of Waikato Hamilton, New Zealand

Abstract. 1 Introduction. Ian H. Witten Department of Computer Science University of Waikato Hamilton, New Zealand Í Ò È ÖÑÙØ Ø ÓÒ Ì Ø ÓÖ ØØÖ ÙØ Ë Ð Ø ÓÒ Ò ÓÒ ÌÖ Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand eibe@cs.waikato.ac.nz Ian H. Witten Department of Computer Science University

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model

More information

DRAFT. Ensemble Pruning. 6.1 What is Ensemble Pruning

DRAFT. Ensemble Pruning. 6.1 What is Ensemble Pruning 6 Ensemble Pruning 6.1 What is Ensemble Pruning Given a set of trained individual learners, rather than combining all of them, ensemble pruning tries to select a subset of individual learners to comprise

More information

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order MACHINE LEARNING Definition 1: Learning is constructing or modifying representations of what is being experienced [Michalski 1986], p. 10 Definition 2: Learning denotes changes in the system That are adaptive

More information

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification Robin Senge, Juan José del Coz and Eyke Hüllermeier Draft version of a paper to appear in: L. Schmidt-Thieme and

More information

Gradient Boosting (Continued)

Gradient Boosting (Continued) Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31 Boosting Fits an Additive Model Boosting Fits an Additive

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Model Averaging With Holdout Estimation of the Posterior Distribution

Model Averaging With Holdout Estimation of the Posterior Distribution Model Averaging With Holdout stimation of the Posterior Distribution Alexandre Lacoste alexandre.lacoste.1@ulaval.ca François Laviolette francois.laviolette@ift.ulaval.ca Mario Marchand mario.marchand@ift.ulaval.ca

More information

DESIGNING RBF CLASSIFIERS FOR WEIGHTED BOOSTING

DESIGNING RBF CLASSIFIERS FOR WEIGHTED BOOSTING DESIGNING RBF CLASSIFIERS FOR WEIGHTED BOOSTING Vanessa Gómez-Verdejo, Jerónimo Arenas-García, Manuel Ortega-Moral and Aníbal R. Figueiras-Vidal Department of Signal Theory and Communications Universidad

More information

Regularized Linear Models in Stacked Generalization

Regularized Linear Models in Stacked Generalization Regularized Linear Models in Stacked Generalization Sam Reid and Greg Grudic University of Colorado at Boulder, Boulder CO 80309-0430, USA Abstract Stacked generalization is a flexible method for multiple

More information

Lossless Online Bayesian Bagging

Lossless Online Bayesian Bagging Lossless Online Bayesian Bagging Herbert K. H. Lee ISDS Duke University Box 90251 Durham, NC 27708 herbie@isds.duke.edu Merlise A. Clyde ISDS Duke University Box 90251 Durham, NC 27708 clyde@isds.duke.edu

More information

1 Handling of Continuous Attributes in C4.5. Algorithm

1 Handling of Continuous Attributes in C4.5. Algorithm .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Data Mining: Classification/Supervised Learning Potpourri Contents 1. C4.5. and continuous attributes: incorporating continuous

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology SE lecture revision 2013 Outline 1. Bayesian classification

More information

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on

More information

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3

More information

Logistic Regression and Boosting for Labeled Bags of Instances

Logistic Regression and Boosting for Labeled Bags of Instances Logistic Regression and Boosting for Labeled Bags of Instances Xin Xu and Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand {xx5, eibe}@cs.waikato.ac.nz Abstract. In

More information

Tutorial 6. By:Aashmeet Kalra

Tutorial 6. By:Aashmeet Kalra Tutorial 6 By:Aashmeet Kalra AGENDA Candidate Elimination Algorithm Example Demo of Candidate Elimination Algorithm Decision Trees Example Demo of Decision Trees Concept and Concept Learning A Concept

More information

Statistics and learning: Big Data

Statistics and learning: Big Data Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,

More information

COMS 4771 Lecture Boosting 1 / 16

COMS 4771 Lecture Boosting 1 / 16 COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Goals for the lecture you should understand the following concepts the decision tree representation the standard top-down approach to learning a tree Occam s razor entropy and information

More information

Classification and Prediction

Classification and Prediction Classification Classification and Prediction Classification: predict categorical class labels Build a model for a set of classes/concepts Classify loan applications (approve/decline) Prediction: model

More information

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012 CS-62 Theory Gems September 27, 22 Lecture 4 Lecturer: Aleksander Mądry Scribes: Alhussein Fawzi Learning Non-Linear Classifiers In the previous lectures, we have focused on finding linear classifiers,

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Introduction to ML Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Why Bayesian learning? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical

More information

High Wind and Energy Specific Models for Global. Production Forecast

High Wind and Energy Specific Models for Global. Production Forecast High Wind and Energy Specific Models for Global Production Forecast Carlos Alaíz, Álvaro Barbero, Ángela Fernández, José R. Dorronsoro Dpto. de Ingeniería Informática and Instituto de Ingeniería del Conocimiento

More information

A Metric Approach to Building Decision Trees based on Goodman-Kruskal Association Index

A Metric Approach to Building Decision Trees based on Goodman-Kruskal Association Index A Metric Approach to Building Decision Trees based on Goodman-Kruskal Association Index Dan A. Simovici and Szymon Jaroszewicz University of Massachusetts at Boston, Department of Computer Science, Boston,

More information

Machine Learning & Data Mining

Machine Learning & Data Mining Group M L D Machine Learning M & Data Mining Chapter 7 Decision Trees Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University Top 10 Algorithm in DM #1: C4.5 #2: K-Means #3: SVM

More information

Ensembles of Classifiers.

Ensembles of Classifiers. Ensembles of Classifiers www.biostat.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts ensemble bootstrap sample bagging boosting random forests error correcting

More information

Ensembles. Léon Bottou COS 424 4/8/2010

Ensembles. Léon Bottou COS 424 4/8/2010 Ensembles Léon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. Léon

More information

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Learning Ensembles. 293S T. Yang. UCSB, 2017. Learning Ensembles 293S T. Yang. UCSB, 2017. Outlines Learning Assembles Random Forest Adaboost Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous)

More information

Study on Classification Methods Based on Three Different Learning Criteria. Jae Kyu Suhr

Study on Classification Methods Based on Three Different Learning Criteria. Jae Kyu Suhr Study on Classification Methods Based on Three Different Learning Criteria Jae Kyu Suhr Contents Introduction Three learning criteria LSE, TER, AUC Methods based on three learning criteria LSE:, ELM TER:

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

Midterm: CS 6375 Spring 2015 Solutions

Midterm: CS 6375 Spring 2015 Solutions Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an

More information

B555 - Machine Learning - Homework 4. Enrique Areyan April 28, 2015

B555 - Machine Learning - Homework 4. Enrique Areyan April 28, 2015 - Machine Learning - Homework Enrique Areyan April 8, 01 Problem 1: Give decision trees to represent the following oolean functions a) A b) A C c) Ā d) A C D e) A C D where Ā is a negation of A and is

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization

More information