Combining Heterogeneous Sets of Classifiers: Theoretical and Experimental Comparison of Methods

Size: px
Start display at page:

Download "Combining Heterogeneous Sets of Classifiers: Theoretical and Experimental Comparison of Methods"

Transcription

1 Combining Heterogeneous Sets of Classifiers: Theoretical and Experimental Comparison of Methods Dennis Bahler and Laura Navarro Department of Computer Science North Carolina State University Raleigh NC Abstract In recent years, the combination of classifiers has been proposed as a method to improve the accuracy achieved in isolation by a single classifier. We are interested in ensemble methods that allow the combination of heterogeneous sets of classifiers, which are classifiers built using differing learning paradigms. We focus on theoretical and experimental comparison of five such combination methods: majority vote, a method based on Bayes rule, a method based on Dempster- Shafer evidence combination, behavior-knowledge space, and logistic regression. We develop an upper bound on the accuracy that can be obtained by any of the five methods of combination, and show that this estimate can be used to determine whether an ensemble may improve the performance of its members. We then report a series of experiments using standard data sets and learning methods, and compare experimental results to theoretical expectations. Keywords: classifier ensembles, voting, Bayesian combination, Dempster-Shafer, behavior-knowledge space, logistic regression 1

2 1 Introduction The combination of the decisions of several classifiers has been proposed as a means of improving the accuracy achieved by any of them. The reasons for combining the outputs of multiple classifiers are compelling, because different classifiers may implicitly represent different useful aspects of a problem, or of the input data, while no one classifier represents all useful aspects. In the context of pattern recognition, the idea of combining the decisions of several classifiers has been well explored. A wide variety of methods have been used to combine the results of several neural networks [6, 7, 13, 20, 23, 26], decision trees [2, 12], sets of rules [3] and other models [10, 11, 28, 31, 29, 30]. Moreover, the idea of combining decisions is not exclusive to the pattern recognition community. For example, Bates et al. [5] describe the combination of forecasts in operations research and French [8] presents an analysis of several methods used to form group consensus statistical distributions. Also in statistics, Uebersax [27] describes how to combine the decision of several experts on the appropriateness of indications for a specific medical procedure. In contrast to approaches which combine models derived from multiple versions of the same learning method, the specific focus of this paper is on ensemble methods that are able to combine the decisions of multiple classifiers of different types, so-called heterogeneous sets of classifiers. Different classifiers typically express their opinions in different ways. For example, an expert may report probabilities, distances between the concept described by an example and already defined concepts, or simply a label representing the predicted class of the example. We formalize our problem as follows. Assume a pattern space U of M mutually exclusive sets U = C 1 C 2... C M,whereeachofC i, for all i {1,...,M} represent a set of patterns called a class. Each expert or classifier (denoted by e) assigns to a sample x U an index j {1,...,M+1} that represents x as being from class C j,ifj M+1, and j = M+1 represents that the expert has no idea about which class x belongs to, i.e. x is rejected by e. The decision of expert e is denoted by e(x) =j. Based on this definition of the decision of an expert, our problem can be defined as the combination of D different classifiers e k, k =1,...,D, each of which assigns x to a label j k denoted e k (x) =j k,intoanintegrated classifier E, which gives x one definitive label j. The performance of a classifier can be described by its recognition, sub- 2

3 stitution, and rejection rates. The recognition rate of a classifier e k (ε (k) r ) measures the probability that its output given sample x corresponds to the actual class of x, as follows: ε (k) r = M P (x C i,e k (x)=i) i=1 The substitution rate, on the other hand, is related to the probability that e k makes the wrong decision, i.e. the output is label j for sample x when x actually belongs to class C i, j i ε (k) s = M P (x C i,e k (x)=j) i=1 j i If a classifier does not have enough evidence to assign x to any class (e k (x) =M+ 1), then ε (k) r + ε (k) s < 1. In fact, 1 ε (k) r ε (k) s measures how often e k rejects samples. To investigate systematically various approaches to combination, we performed the following set of experiments. We began with four data sets from the UCI repository [18]: Tic-Tac-Toe Endgame, Wisconsin Breast Cancer, Pima Indians Diabetes, and Credit Screening. These data sets were each used to train a heterogeneous set of three classifiers using standard techniques: a decision tree, a Bayesian belief network, and a backpropagation neural network. To obtain an integrated classifier, the results of the classifiers for each data set were then combined into an ensemble using five methods: majority vote, a method based on Bayes rule, a method based on Dempster-Shafer evidence combination, behavior-knowledge space, and logistic regression. Next, by examining conditions under which the combination of classifiers does not improve the accuracy obtained by a single expert, we developed a method to predict the theoretical maximum recognition rate that can be achieved by each of the ensemble methods. Finally, we compared the experimentally observed changes in accuracy of the ensemble models with the theoretical bounds we developed earlier. The remainder of this paper is organized as follows. In Section 2, we briefly describe the five methods that we used to combine multiple classifiers. In Section 3, we describe conditions for improvement in the recognition rate of a majority voting ensemble over the single classifiers and give estimators 3

4 of the accuracy of a vote-based ensemble. In the same section, we analyze the behavior of the remaining ensemble methods and define upper bounds on the accuracy that may be obtained by any of these combination methods. In Section 4, we present the results of experiments that show the applicability of this analysis. We offer some conclusions in Section 5. 4

5 2 Methods of Combining Multiple Classifiers Combination methods arising in statistics [8] and operations research [5] typically combine not only classification but also measures of confidence in these opinions. Confidence measures may be combined using statistical methods for continuous data, such as average [6, 16], opinion pools [8] or Bayes theory [8]. These combination techniques require that measures be homogeneous, that is, of the same kind for all the experts in the group; for example, all of them represent probabilities, or distances, or they need to be converted to equivalent scales. If the experts are not able to supply the same measure of confidence, or not able to supply any measure at all, we are confronted with a different problem. This is the case of interest to us, when several heterogeneous classifiers are to be combined, where the only common information that is available from all the experts is their opinion. Methods have been developed to deal with this problem, and they are described in this section. 2.1 Voting Generally speaking, the voting principle is just what we know as majority voting. Several variations of this idea have been proposed: The combined classifier decides that an input pattern x comes from class C j if and only if all the classifiers decide that x comes from class C j, otherwise it rejects x. The combined classifier decides that an observation x comes from class C j if some classifiers support that x belongs to C j, and no other classifier supports that x belongs to any other class (i.e. they reject x), otherwise it rejects x. The majority rule, where the combined classifier decides an observation x belongs to class C j if more than half of the classifiers support that x belongs to C j. A modification of this rule is to require a different proportion of classifiers to agree instead of half of them. The combined classifier decides for the observation x belonging to class C j if the number of classifiers that support it is considerably bigger than the number of classifiers that support any other class. 5

6 Combining classifiers with this method is simple; it does not require any previous knowledge of the behavior of the classifiers nor does it require any complex methodology to decide. It only counts the number of classifiers that agree in their decision and accordingly decides the class to which the input pattern belongs. This simplicity has a drawback, however: The weight of the decision of all the classifiers is equal, even when some of the classifiers are much more accurate than others. 2.2 Bayesian Ensemble Methods Voting methods are based solely on the output label computed by each classifier. No expertise or accuracy is considered. In these methods the decision of each classifier is treated as one vote, but what happens if one of the classifiers is much more accurate than any other? Should its accuracy not be considered in the combination? Let us suppose that we ask two experts about their opinions on whether a stock value will increase in the next month. We are likely to consider their opinions differently depending on how accurate they have been in the past. To address this problem we can establish weights proportional to each expert s accuracy, so each classifier s output is considered according to its past performance. Several methods of establishing these weights have been developed. In the following sections some of them will be described. One way to represent an expert behavior is through a confusion matrix, which represents the accuracy of each classifier as the following: CM k = n (k) 11 n (k) 12 n (k) 1M n (k) 21 n (k) 22 n (k) 2M... n (k) M1 n (k) M2 n (k) MM n (k) 1M+1 n (k) 1M+1. n (k) 1M+1 k =1,...,D where each row i represents the number of samples of class C i,andeach column j represents the number of samples that have been assigned label jbytheexperte k (e k (x)=j). This means that each n (k) ij represents the number of samples of class C i that have been assigned label j by expert e k. 6

7 Expert Real Class e 1 e 2 Up Down Up Up Up Down 35 5 Up Unknown 5 1 Down Up 4 30 Down Down 3 35 Down Unknown 1 8 Unknown Up 13 0 Unknown Down 1 8 Unknown Unknown 2 0 Table 1: Summary of the experts predictions in the stock example. Consider again the example of the experts in the stock market. Table 1 shows all possible combinations of decisions made by the two experts in the past, and, for each of these combinations, indicates how many stock values actually increase in the next month, and how many decreased. Given this information, the confusion matrices for the two experts are: CM 1 = CM 2 = The number of samples of class C i can be obtained by M+1 j=1 n (k) ij = n (k) i+ The number of samples that have been assigned label j is: M i=1 n (k) ij = n (k) +j 7

8 and the total number of samples is: N = M i=1 M+1 ThefactthatxcomesfromclassC i given that classifier e k output label j for x (event e k (x) =j) has uncertainty. Through the confusion matrix CM k, such uncertainty could be approximated by the conditional probabilities that x belongs to class C i, i =1,...,M given that e k (x) =j, j=1 n (k) ij P(x C i e k (x)=j)= n(k) ij n (k) +j = n(k) ij Ml=1 n (k) lj i =1,...,M (1) The confusion matrix CM k could be regarded as the prior knowledge of expert e k. Based on the occurrence of the event e k (x) =j,theexpert expresses its beliefs with uncertainty on each of the M mutually exclusive propositions x C i, for all i {1,...,M} by a real value called belief. The belief value given the decision of a classifier e k with respect to an example x, can be defined as: bel(x C i e k (x)) = P (x C i e k (x) =j) i=1,...,m When D different experts e 1,...,e D with D confusion matrices CM 1,...,CM D are used over the same observation x, D events e i (x) =j i,i=1,...,d will occur. Each event will supply its own set of bel(x C i e k (x)), i =1,...,M. Now we need to find a way to integrate these beliefs to give a combined value, given by: bel(c i ) = bel(x C i e 1 (x),e 2 (x),...,e D (x)) = P (x C i e 1 (x) =j 1,e 2 (x)=j 2,...,e D (x)=j D ) = P(e 1(x)=j 1,e 2 (x)=j 2,...,e D (x)=j D x C i )P(x C i ) (2) P(e 1 (x)=j 1,e 2 (x)=j 2,...,e D (x)=j D ) i =1,...,M If the classifiers perform independently of each other, then the events e i = j i, i =1,...,D will be independent of each other and we have: 8

9 bel(c i ) = = = Dk=1 P (e k (x) =j k x C i ) Dk=1 P(x C i ) P (e k (x) =j k ) Dk=1 P (e k (x) =j k,x C i ) Dk=1 P (e k (x) =j k ) D k=1 P (x C i ) P (x C i) Dk=1 P (x C i e k (x) =j k ) P(x C P(x C i ) D i ) where P (x C i e k = j k ) can be estimated by (1). A good estimator of P (x C i ) should be the posterior probabilities P (x C i x), but they are not available, so we estimate the value of the bel(c i )with: bel(c i ) Dk=1 P (x C i e k (x) =j k ) Ml=1 Dk=1 P (x C l e k (x) =j k ) This substitution will ensure that M i=1 bel(c i ) = 1, condition required since the events x C i,i=1..m are mutually exclusive and exhaustive. As a result of this ensemble method, from the opinions of the stock price experts, we obtain a belief about the gain or loss in stock value depending on the combination of decisions of the two classifiers. Then the belief that a stock value x will gain value in the next month given that e 1 (x) =Up and e 2 (x) =Down is obtained by: P 1 (Up) = P(x Up e 1 (x)=up) P 1 (Down) = P(x Down e 1 (x) =Up) P 2 (Up) = P(x Up e 2 (x)=down) P 2 (Down) = P(x Down e 2 (x) =Down) bel(up) = P 1 (Up)P 2 (Up) P 1 (Up)P 2 (Up)+P 1 (Down)P 2 (Down) = =0.86 The beliefs for all possible experts decisions in the example are shown in Table 2. Finally, depending on these values of bel(.), x can be classified into class C i if: 9

10 Expert e 1 Expert e 2 Up Down Unknown Up Down Unknown Up Down (a) Estimates of Conditional probabilities P (x C i e k (x) =j) e 1 (x) e 2 (x) Bel(Up) Bel(Down) E(x) Up Up Up Up Down Up Up Unknown Up Down Up Down Down Down Down Down Unknown Down Unknown Up Up Unknown Down Up Unknown Unknown Up (b) Combined beliefs Table 2: Bayes combination example E(x) = { i if bel(ci )=max j=1..m bel(c j )andbel(c i ) α M + 1 otherwise where 0 α 1 is a threshold. 2.3 Behavior-Knowledge Space Methods One of the strongest conditions for the applicability of Bayes rule to combine the decision of several experts is the fact that they must perform independently. The Behavior-Knowledge Space method described in [11, 28], by contrast, makes no assumption about the data to be combined. It determines the 10

11 belief degree on a proposition x C i based on the combination of evidences, e k = j k,k =1,...,D according to bel(c i )= P(e 1(x)=j 1,e 2 (x)=j 2,...,e D (x)=j D,x C i ) P(e 1 (x)=j 1,e 2 (x)=j 2,...,e D (x)=j D ) corresponding to Equation (2) used in the Bayesian approach. This definition can be easily implemented through a Behavior-Knowledge Space (BKS). A BKS is a D-dimensional space, where each dimension corresponds to the decision of one classifier. Each classifier has M + 1 possible decision values, for the M classes plus the rejection of the given sample by the classifier (class M + 1). The intersection of the decisions of individual classifiers occupies one point of the BKS. Each such point, denoted BKS(j 1,j 2,...,j D ), j i = 1,...,M + 1, for all i =1,...,D, contains three kinds of data: n j1,...,j D (i): Total number of patterns x such that e 1 (x) =j 1,...,e D (x)= j D and x C i,i=1,...,m T j1,...,j D : Total number of patterns x such that e 1 (x) =j 1,...,e D (x)= j D. M T j1,...,j D = (n j1,...,j D (i)) R j1,...,j D : The best representative class of BKS(j 1,j 2,...,j D ) i=1 n j1,...,j D (R j1,...,j D )= max (n j 1,...,j D (i)) (3) i=1,...,m Continuing with our example of the stock market, the BKS is shown in Table 3. Once the BKS is obtained, the belief that x belongs to class C i, bel(c i ), for all i {1,...,M} can be calculated by: bel(c i ) = P(e 1(x)=j 1,e 2 (x)=j 2,...,e D (x)=j D,x C i ) P(e 1 (x)=j 1,e 2 (x)=j 2,...,e D (x)=j D ) = n j 1,...,j D (i) T j1,...,j D 11

12 e 1 Up Down Unknown n[u] n[d] T R n[u] n[d] T R n[u] n[d] T R Up U D U e 2 Down U D D Unknown U D U U=Up,D=Down Table 3: Behavior-Knowledge Space example Similarly to the combination using Bayes rule, x is classified into class C i if bel(c i )=max j=1..m bel(c j )=bel(c Rj1,...,j D ) given Equation 3. Finally we can classify x into a class according to the decision rule: { Rj1,...,j E(x) = D if T j1,...,j D > 0andbel(C ) α Rj1,...,jD M + 1 otherwise where 0 α 1 is a threshold. 2.4 Dempster-Shafer Ensemble Methods The use of Bayes rule to combine the decision of several classifiers may be inappropriate in some cases. Bayes rule requires the belief measures to behave as probabilities [25] and in the case of decision of experts this is a requirement often impossible to satisfy. For example, according to Bayes rule, given ε (k) r and ε (k) s, the axioms of probability require ε (k) r + ε (k) s = 1, but this is not the case when e k rejects a sample x. In this case we cannot say that e k was correct in its decision, but neither can we say that it assigned the wrong label to x since it did not assign any label. In this case it would be reasonable to say that both bel(x C i )andbel(x C i ) have value 0. The Dempster-Shafer calculus is a system for manipulating degrees of belief that is more general than the Bayesian approach and does not require the assumption that ε (k) r + ε (k) s = 1. [31] introduces a method to combine multiple classifiers by adapting Dempster-Shafer s evidence theory. 12

13 In using Dempster-Shafer theory to combine the decisions of several classifiers, the exclusive and exhaustive possibilities that form the frame of discernment θ are the propositions x C i, i =1,...,M that denote that an input sample x comes from class C i. When D classifiers e 1,e 2,...,e D,analyzex, they will produce D items of evidence e k (x) =j k,wherej k {1,...,M+1}. Considering ε (k) r and ε (k) s the recognition and substitution rates of classifier e k and e k (x) =j k, for each k =1,...,D, one can have uncertain beliefs that x C jk is true with degree ε (k) r and is not true with degree ε (k) s when j k {1,...,M}.Ifj k =M+ 1 (i.e. x is rejected by e k ) no information about the M propositions x C i is given, for all i {1,...,M}. In this case full weight of support should be given to the entire frame of discernment θ. Now, we can define a basic probability assignment (BPA) function m k for the evidence e k (x) =j k in the following way: If j k = M + 1, x was rejected by e k,som k (x C i )=0andm k (x C i ) = 0 (for all i {1,...,M}), and m k (θ) =1sincee k says nothing about any one of the propositions. If j k {1,...,M}, then there are two focal elements x C jk and x C jk so that x (θ {C jk })withm k (x C jk )=ε (k) r and m k (x C jk )=ε (k) s. Since e k does not say anything about about any other class, m k (θ) = 1 ε (k) r ε (k) s. Once we obtaine the D BPA s m k, k = 1,...,D, Dempster s rule of combination can be used to obtain a combined m = m 1 m 2... m D.We would then have the basis to calculate Bel(C i ) for all i {1,...,M} based on the D BPA s. First, the items of evidence e k = M + 1 are discarded since they have no influence on the result of the combination, obtaining D D classifiers to be considered. We may also rule out three special cases: D = 0. In this case all the classifiers rejected x, so the final decision is to reject x too. There is one classifier e k which has a recognition rate of ε (k) r =1. In this case that classifies correctly any input so the combination is not necessary. 13

14 There is one classifier e k which has a substitution rate of ε (k) s =1. In this case it always makes the wrong decision, so it will not be used. However, if there are only two classes, the complement of the decision of e k will make its recognition rate ε (k) r =1. We can then concentrate on D items of evidence e k = j k,k =1,...,D with 0 <ε (k) r <1and0 ε (k) s <1. Taking the information from the stock market example, the resulting items of evidence and their BPAs for each possible output of the experts is shown in Table 4. e 1 (x) e 2 (x) m 1 (Up) m 1 (Up) m 1 (θ) m 2 (Up) m 2 (Up) m 2 (θ) Up Up Up Down Up Unknown Down Up Down Down Down Unknown Unknown Up Unknown Down Unknown Unknown Table 4: BPAs of each possible outcome of experts decisions in the stock example, given that ε 1 r =0.8,ε1 s =0.1,ε2 r =0.6andε1 s =0.33. To obtain the combined BPA we need to calculate m = m 1 m 2... m D, but the computational cost to obtain m increases exponentially with M, the number of classes, since we have to obtain all possible subsets of θ and θ = M. However in this particular problem, the BPA of each evidence only has two focal elements, so cost can be reduced by using the following method, proposed in [31], consisting of two main steps. In the first step, the items of evidence are collected into D 1 groups, E 1,E 2,...,E D1, with those classifiers supporting the same proposition in the same group in such a way that e k (x) =j k E k if e k (x) =j k =j k.then combine the BPA s m k in each group recursively as follows: m 1 (x C jk )=m k 1 (x C jk ) 14

15 m 1 (x C jk )=m k 1 (x C jk ) Then, for r =2,...,p, where p is the number of evidences in the group k r = 1 1 m r 1 (x C jk )m k r (x C jk ) m r 1 (x C jk )m k r (x C jk ) m r (θ) = k r m r 1 (θ)m k r (θ) m r (x C jk ) = k r m r 1 (x C jk )[m k r (θ)+m k r (x C jk )] + k r m k r (x C jk )m r 1 (θ) m r (x C jk ) = k r m r 1 (x C jk )[m k r (θ)+m k r (x C jk )] + k r m k r (x C jk )m r 1 (θ) As a result we obtain D 1 combined classifiers E 1,...,E D1 with their corresponding items of evidence E k (x) =j k,k =1,...,A 1 (j 1 j 2... j D1 ), with recognition rate ε (k) r = m k (x C jk ) and substitution rate ε (k) s = m k (x C jk ) (See Table 5 for an example). e 1 (x) e 2 (x) E 1 E 2 ε (1) r ε (1) s ε (2) r ε (2) s Up Up Up Up Down Up Down Up Unknown Up Down Up Up Down Down Down Down Down Unknown Down Unknown Up Up Unknown Down Down Unknown Unknown Table 5: Combined BPAs The second step combines these new BPA s m k,k = 1,...,D 1 into a final combined BPA m = m 1 m 2... m D1 andthencalculatesthe corresponding Bel(C i ), for all i {1,...,M}. Since each one of these new BPA s m k,k =1,...,D 1 supports either one proposition (x C i,i =1,...,M)oritsnegation(x C i,i =1,...,M) we may use the formulae developed in [4] to obtain the required beliefs. 15

16 Considering that in our case m k (θ) = 1,k = D 1 +1,...,M, then from formulae (6)-(8) in [4] we obtain: A = B = C = D 1 i=1 D 1 i=1 D 1 i=1 m i (x C i ) 1 m i (x C i ) (1 m i (x C i )) m i (x C i ) k 1 = { (1 + A)B C if D1 = M (1 + A)B if D 1 <M For Bel(C j k ) four different cases may arise, Bel(C j k )= ( k ( k B m k(x C j k ) +C 1 m k (x C j ) k ) B m k(x C j k ) 1 m k (x C j k ) ) m k(θ) m k (x C j ) k if D 1 = M if D 1 <M and j k D 1 kc if D 1 = M 1andj k =M 0 otherwise Table 6 shows the beliefs obtained for each possible decision outcome of the two experts in the stock example. Finally, depending on these values of Bel(.), x can be classified into class C i if: { i if Bel(Ci )=max E(x) = j=1..m Bel(C j )andbel(c i ) α M + 1 otherwise where 0 α 1 is a threshold. 2.5 Logistic Regression Ensembles Another method that assigns weights to each classifier is the combination using logistic regression [10]. 16

17 e 1 (x) e 2 (x) Bel(Up) Bel(Down) E(x) Up Up Up Up Down Up Up Unknown Up Down Up Down Down Down Down Down Unknown Down Unknown Up Up Unknown Down Down Unknown Unknown 0 0 Unknown Table 6: Combined beliefs Consider Y i, a binary variable that has value 1 when x C i and 0 otherwise. The goal of recognition is to predict the value of Y i, for all i =1,...,M. Since Y i is a binary value, the combination problem may be reformulated in the context of logistic regression analysis, where Y i depends on the value of explanatory variable z i =(z i1,z i2,...,z id )whereeachz ik =1ife k (x)=i and zero otherwise. Denote P (x C i z i )byπ(z i ), using the logistic regression function, π(z i )= exp(α i + β i1 z i β id z id ) 1+exp(α i +β i1 z i β id z id ) and, ( ) π(zi ) log =α i +β i1 z i β id z id 1+π(z i ) where α i,β i1,...,β id are constant parameters and log ( ) π(z i ) 1+π(z i ) is the logit. Since the logit is linearly dependent on z i, the values of the parameters α i,β i1,...,β id can be estimated through weighted least-squares [1]. To determine a good estimator for the true logit [1], assume there are n ij observations at the jth setting of z i,andy ij is the number of examples with Y i = 1 obtained in this setting. The jth sample logit is then log[y ij /(n ij y ij )]. Since this value is not defined when y ij or n ij are zero, the logit is adjusted 17

18 to the value ( yij + 1 ) 2 log n ij y ij Considering the stock value example described earlier, the logit values for each class are shown in Table 7. z i1 z i2 logit(z 0 ) logit(z 1 ) Table 7: Logit example For each test pattern x, the value of M vectors z i is obtained from the events e k (x) =j k, and then x will be classified into a class according to the decision rule: { i if logit(zi )=max E(x) = j=1..m logit(z j )andπ(z i ) α M+ 1 otherwise where 0 α 1 is a threshold. When the number of classes is big or there is not enough data available, a simpler mode can be constructed by using only one response variable Y for all the classes. 18

19 3 Analysis of Ensemble Methods In the last section we described several methods of creating ensembles of classifiers. These methods have all been used in applications, but the empirical evidence of their effectiveness is mixed. In many studies they have improved the accuracy obtained by any single classifier [3, 10, 11, 28, 31]. At the same time, in other analytical studies no improvement was seen from that of the single classifier [9, 16]. Characteristics of the data to be combined determine the degree of improvement obtained by the combination. Since majority voting is not based on the performance of the members of the ensemble, unlike all our other combination methods. we analyze the performance of majority voting separately. 3.1 Majority Voting When multiple classifiers are combined using majority vote, we expect to obtain good results based on the belief that the majority of experts are more likely to be correct in their decision when they agree in their opinion. So if the decision of D classifiers are combined, and more than half of them decide that observation x belongs to class C i, the ensemble decides that x C i. Even when common sense leads us to believe that combining expert s decisions in this way will improve the correctness of the integrated result, this is not always true. Hansen and Salamon [9] prove that if all of the classifiers being combined have an error rate less than 50%, we may expect the accuracy of the ensemble to improve when we add more experts to the combination. However, their proof is restricted to the situation when all of the classifiers perform independently and have the same error rate. This conclusion depends on the fact that if all D single classifiers have a certain likelihood (1 p) of arriving at the correct classification, then the likelihood of the majority rule being in error is D k= D/2 ( D k ) p k (1 p) D k However, it is not common to obtain the same error rate for several different classifiers. Hansen and Salamon explain that the consensus error rate 19

20 may be less than the individual error rate. This will occur when the minimum error rate p 1 obtained by any of the classifiers in an ensemble of three classifiers meets the following criterion: p 1 >p 1 p 2 p 3 +(1 p 1 )p 2 p 3 +p 1 (1 p 2 )p 3 + p 1 p 2 (1 p 3 ) (4) where p 1,p 2 and p 3 are the error rates of three classifiers. Even though this condition may explain when the combination does not lower the error rate obtained by the best classifier, it does not predict the accuracy that can be achieved by the ensemble. In fact, it will generally estimate the accuracy of the combination higher than the one actually obtained since it expects the classifiers to perform independently when, in fact, they do not. One of the factors affecting the estimate obtained in Equation 4 is the fact that the classifiers make errors in a correlated manner. Ali and Pazzani [2] show that the degree of error reduction obtained by the ensemble of classifiers that uses majority voting is related to the degree to which the members of the ensemble make errors in an uncorrelated manner. The error correlation, φ, is defined as the probability that two different classifiers e i,e k make the same kind of error, and it can be measured by φ = 1 D(D 1) D D i=1 k=i+1 p(e i (x) =e k (x)=j, x C j ) (5) where the value of φ is higher when the members of the ensemble make more correlated errors. A limitation in φ is that it is a pairwise measure and it is restricted to ensembles that make errors when two of its members make errors, i.e, considering majority voting, φ is restricted to ensembles of three experts. More specific is the work of Matan [15] that establishes lower and upper bounds for the ensemble of classifiers, showing that majority voting might, in certain conditions, perform worse than any of the members of the ensemble. For example, consider Table 8. The classifiers have recognition rates of 60%, 60% and 50% respectively. However, the majority voting ensemble obtains only a 40% accuracy. Nevertheless, if the decisions of the three classifiers, with the same accuracies, are distributed as in Table 9, the ensemble has a recognition rate of 80%. 20

21 Classification for expert True class Majority e 1 e 2 e 3 C 1 C 2 decision Table 8: Majority vote example where the ensemble performs worse that any of its members Classification for expert True class Majority e 1 e 2 e 3 C 1 C 2 decision Table 9: Majority vote example where the ensemble performs better that any of its members 21

22 So, in [15], the accuracy of a majority voting ensemble is bounded by the following theorem: Given an ensemble E of D classifiers {e 1,e 2,...,e D }with respective performances g 1 g 2... g D, the performance of a k-of-d majority voting classifier under the same distribution is bounded by: max(e) = min(1, Σ(k), Σ(k 1),..., Σ(2), Σ(1)) min(e) = max(0,ξ(k),ξ(k 1),...,ξ(2),ξ(1)) (6) where Σ(m) = 1 m ξ(m)= 1 m n i=k m+1 D k+m i=1 g i g i D k m m =1,...,k m =1,...,k 3.2 Ensembles Based on the Performance of their Members The majority voting combination does not consider that some classifiers of the ensemble may be more accurate than others and treats the opinions of each expert as a vote, independently of the fact that this classifier may be always wrong or right. In attempt to improve the performance of the combination over majority voting, we explored other ways of combining different classifiers that consider individual classifier performance. Other methods to combine the decision of the experts use Bayes theory, behavior-knowledge space, Dempster-Shafer calculus and logistic regression. These methods have different characteristics and rely on different assumptions, so each may combine some kinds of data better than others. We will also show how to estimate an upper bound for the performance of any ensemble, regardless of the method used to combine Independence Combinations based on Bayes theorem and Dempster-Shafer theory assume independence in the decisions of the members of the ensemble. In [31] it was proposed that this independence can be achieved if the experts are trained 22

23 over different data sets. However, considering that all the experts are modeling the same function that defines the real class of each observation, it is unlikely that they behave independently as required by these ensemble methods. Does it then mean that these combination methods cannot be used? To answer this question we must consider why independence is required. For an ensemble based on Bayes theory, the assumption of independence in the decision of the classifiers allows the calculation of the value of P (e 1 (x) = j 1,...,e D (x)=j D x C i ),i =1,...,M using the probabilities P (e k (x) = j k x C i ),k =1,...,D by: D P (e 1 (x) =j 1,...,e D (x)=j D x C i )= P (e k (x) =j k x C i ). k=1 These probabilities are then used to obtain the class of an observation x, by choosing the class with maximum belief, given the events e k (x) = j k,k = 1,...,D. The belief of x C i, (Bel(C i )),i=1,...,m is obtained by: Bel(C i ) Dk=1 P (x C i e k (x) =j k ) Ml=1 Dk=1 P (x C i e k (x) =j k ) (7) andthisisusedtoestimate: Bel(C i )=P(x C i e 1 (x)=j 1,e 2 (x)=j 2,...,e D (x)=j D ). (8) Since the result of the ensemble depends on the class with highest belief, we will obtain the same results from the ensemble using Equation 7 or Equation 8 if, for all the combinations of events e k (x) = j k,k =1,...,D and j k =1,...,M +1, ( Dk=1 ) P (x C i e k (x) =j k ) max i=1,...,m Ml=1 Dk=1 = P (x C i e k (x) =j k ) max i=1,...,m P (x C i e 1 (x) =j 1,...,e D (x)=j D ) (9) So, even though Equation 9 holds only when the decisions of the members of the ensemble are independent, the fact that they are not does not prevent the ensemble from performing equally in both cases. 23

24 A similar argument may be used for ensembles based on Dempster-Shafer theory. Ensembles using behavior-knowledge space (BKS) do not assume that the decisions of the classifiers are independent, and when they are, this method and the one based on Bayes theory should obtain the same performance, since they are based on the same principle, as was shown in Section 2.3. When the BKS method is used with D experts that decide between M classes, then there exist (M +1) D possible combinations of decisions to be taken into account. We also require enough samples from each possible combination of decisions to obtain a representative number of observations. For example, consider an ensemble of five experts and three classes using 10 samples per combination. This will require at least observations to train it. So if the BKS ensemble method is used to combine the decisions of more than a few experts or decide between more than 2 or three classes it requires large training sets that may not be available in many cases Maximum Accuracy of an Ensemble The sections above explain when each kind of combination scheme is likely to perform better than others. We have seen that when there is not enough data available for training it is better to use ensembles based on Bayesian theory than BKS, and that the combination using Dempster-Shafer calculus is better suited than other combination techniques to manipulate uncertain decisions (class M + 1) of the individual classifiers. The above analysis could lead us to believe that even when some ensemble methods might not obtain better accuracies than the single classifiers, careful selection of the combination method, or creation of other combination methods, will always allow us to find a way to improve the performance obtained by any single expert. There is, however, a limit in the accuracy ensembles can achieve regardless of the combination method used. Consider, for example, three experts e 1,e 2 and e 3 that assign one of two classes, C 1 and C 2, to a sample. As an example, the number of observations of each class corresponding to all possible combinations of decisions of the three classifiers is shown in Table 10. Consider the number of observations of each class when e 1 (x) =2,e 2 (x)= 1ande 3 (x) = 1. The table shows that there are 13 samples of class 1 and 25 24

25 Classification for expert True class e 1 e 2 e 3 C 1 C Table 10: Number of samples per combination of decisions per real class 25

26 samples of class 2 that have been classified with this combination of decisions. Since the only data available for the ensemble is the class assigned to the observation by each one of the classifiers, the ensemble will always classify incorrectly at least 13 of these observations, in spite of the ensemble method used. Notice that this situation occurs also with any other combination of results. For example, when e 1 (x) =1,e 2 (x)=2ande 3 (x) = 1 it is impossible to decide whether the sample x belongs to class 1 or 2 since the probability of x belonging to any of the classes is the same. From these examples we can observe how the ensemble of classifiers can obtain better accuracies as long as most of the samples for each combination of decisions of experts belong to the same class. So, if we can measure the concentration of samples x that have been classified by e 1 (x) =j 1,...e k (x)= j k and belong to a specific class, we can obtain a measure of the expected accuracy of the ensemble. Let us define n j1...j D i as the number of samples of class C i that have been classified as e k (x) = j k by each expert k = 1,...,D and n as the total number of samples. The relative frequency of the number of samples x such that e 1 (x) =j 1,...e k (x)=j k and x C i π j1...j D i is, π j1...j D i = n j 1...j D i n and, π +...+i = M+1 j 1 =1... M+1 j D =1 n j1...j D i n Mi=1 n j1...j π j1...j D + = D i n Then concentration can be measured using the Goodman and Kruskal s tau, τ, that describes the proportional reduction in the probability of an incorrect guess in the class of an observation given the decisions of the experts [1]. The measure τ is defined as τ = M+1 j 1 =1... M+1 Mi=1 (π j1...j D i) 2 j D =1 M π j1...j D + i=1 (π +...+i ) 2 1 M i=1 (π +...+i ) 2 26

27 A large value of τ represents a strong association between the decision of the experts and the actual class of the observation. This measure, as well as any other measure of concentration or uncertainty, indicates when the combination of decisions of single classifiers has information enough to decide the class of a sample. A difficulty with this measure is to determine how strong is a strong association. This difficulty is more pronounced the higher the number of classes, since in this case the value of τ will tend to 0 even when there is a strong association. The concentration measure can show which of two data sets having the same number of classes is more suited to have the most accurate ensemble of experts, but it cannot estimate how accurate this ensemble can be. Let θ be the set of all possible classes {1,...,M}to which an observation can belong. To obtain the maximum accuracy achievable by any ensemble of D experts, we define function G:(θ {M+1}) D θ. Given the decision of D classifiers regarding the class of an observation x, e 1 (x) =j 1,...e D (x)= j D, G returns the class with the most observations for that combination of decisions. In the example in Table 10, G(2,1,1) = 1. Given G, we can then define the maximum accuracy that can be obtained by an ensemble as: M+1 j 1 =1... M+1 j D =1 n j1...j D G(j 1...j D ) N where n j1,...,j D,i represents the number of samples of class C i that have been classified as e k (x) =j k by each expert k =1,...,D, and N is the total number of samples considered. In the example in Table 10 we find that even using the best combination method possible, the best recognition rate we may obtain is 80% Reliability Usually only two measures are used to describe the performance of a classifier: Accuracy or recognition rate, which measures the probability that the output label of an expert after considering sample x represents the actual class of x, or its substitution rate, that measures the probability of the expert making a mistake. 27

28 However, a classifier may also reject a sample, i.e., the expert e k might not be able to classify x into any class because there is not enough evidence of x belonging to a particular class. Rejected observations cannot be considered in the recognition or substitution rates. When some of the samples may be rejected, the recognition rate fails to provide a good description of the performance of the classifier because it does not take into consideration the rejection rate. In this case the reliability gives a better description of the accuracy of a classifier than the recognition rate. Reliability is defined as: Reliability = Recognition 1 Rejection and represents the proportion of samples not rejected by the expert e k that have been classified correctly. Consider the example in Table 10. Earlier we mentioned that when e 1 (x) =1,e 2 (x)=2ande 3 (x) = 1 the decision of whether x C 1 or x C 2 was not an easy one. In fact, the probability of making an error for any class chosen is the same; for this combination of evidence at least half of the samples are classified incorrectly by any ensemble. In this case, the ensemble can benefit from rejecting all the observations in this situation, leading to an increase on the ensemble reliability over the reliability of the single classifiers. An increase in reliability means that whenever the ensemble decides to classify a sample x as belonging to class C i instead of rejecting it, the probability of x C i is higher than when no samples are rejected. 28

29 4 Experiments We carried out several experiments to test the predictive value of the estimates for the performance of the ensemble methods proposed in Section 3, as well as the expected behavior of these methods considering their reliability and the independence of their decisions. Four data sets were taken from the UCI repository [18]. These data sets were used to train three different classifiers, whose results were then combined to obtain the integrated classifier. 4.1 Data The data sets were chosen because of their number of observations, number of classes, noise, and missing features. In general we chose data sets that had more than one hundred observations per class. We wanted the groups chosen to be a mixture of noisy/not noisy, with/without missing features. Based on these parameters we chose Tic-Tac-Toe Endgame (TTT), Wisconsin Breast Cancer (WBC), Pima Indians Diabetes (PIMA), and Credit Screening (CRX). All these data sets use only 2 classes. Each data set was partitioned into four subsets, G 1,G 2,G 3 and G 4,so that their union forms the complete data set and their pairwise intersection is empty. The single classifiers were then trained using one of the groups, and the accuracy of the resulting classifier was measured using the examples in the other three groups. The performance of majority voting was tested using the three partitions not used to train the classifiers. The behavior of the ensembles requiring knowledge of the performance of their members was evaluated using one of the remaining groups to train the ensemble and the other two to test them. The training of the ensemble methods involved obtaining: Bayes: the confusion matrices for each of the members of the ensemble; BKS: the Behavior-Knowledge Space; Dempster-Shafer: the recognition, substitution, and rejection rates for each individual classifier; Logistic Regression: the constant parameters of the logit function for each class. 29

30 This procedure was repeated for all permutations of the four groups in each data set to train the single classifiers and the combination method, and to test the ensembles. Thus, except for majority voting, the ensembles are tested in twelve different ways for each data set as shown in Table 11. # Train Test Single Ensemble Ensemble 1 G 1 G 2 G 3 G 4 2 G 1 G 3 G 2 G 4 3 G 1 G 4 G 2 G 3 4 G 2 G 1 G 3 G 4 5 G 2 G 3 G 1 G 4 6 G 2 G 4 G 1 G 3 7 G 3 G 1 G 2 G 4 8 G 3 G 2 G 1 G 4 9 G 3 G 4 G 1 G 2 10 G 4 G 1 G 2 G 3 11 G 4 G 2 G 1 G 3 12 G 4 G 3 G 1 G 2 Table 11: Partition of data sets for the 12 experiments Classifiers The ensemble of classifiers requires, as data, the decision of several experts with respect to a given observation. For our experiments we used three different classifiers, each of them based in a particular machine learning technique. The classifiers used are: Expert e 1 : Bayesian belief network, developed using the library Netica API [19]. No modification or selection of features has been done except discretization of continuous data. This kind of classifier returns the probabilities that a sample belongs to each class. To obtain the result of the classifier we take the class with greater probability. 30

31 Expert e 2 : Neural network, using backpropagation as a learning procedure. Training and testing of the neural networks was done using the Aspirin/Migraines Neural Network Software [14]. This classifiers returns a number between 0 and 1. We consider a sample as belonging to a class if the difference between the output produced upon the analysis of that sample and number between zero and one assigned to that class is the minimum. Since all the data sets used have only to classes, the numbers assigned to the classes are 0 and 1. Expert e 3 : Decision tree, obtained by training and testing using C4.5 [22]. This kind of classifier returns a label representing the predicted class of the ensemble. 4.2 Voting Majority voting is, perhaps, one of the best studied of the ensemble methods described. In Section 3.1 we described some measures that may help to predict the performance of an ensemble that uses majority voting. One of the conditions for the improvement of accuracy through majority voting is the one proposed by Hansen and Salamon [9]. They established that given (1 p i ) the probability of expert e i of predicting the class of an example correctly, then assuming the expert e 1 is the member of the ensemble with lowest error rate, the ensemble can have a lower error rate than its members if p 1 >p 1 p 2 p 3 +(1 p 1 )p 2 p 3 +p 1 (1 p 2 )p 3 + p 1 p 2 (1 p 3 )=ˆp E (10) However, as we can see in Table 12, even though ˆp E is always less than the error rate of all single classifiers, the ensemble does not perform better than the best one its members, except for the credit screening data set. This is due to the correlation of errors between classifiers, as we will show next. Consider the error ratio, the relation between the error of the ensemble with respect to the error rate p 1, of the best of the single classifiers: ErrorRatio = ErrorEnsemble p 1 31

32 Error Rate Error p 1 p 2 p 3 Voting ˆp E φ Ratio CRX 16.76% 16.96% 19.08% 16.04% 8.21% PIMA 25.65% 34.90% 29.86% 27.21% 21.69% TTT 30.79% 31.42% 24.33% 28.19% 20.11% WBC 36.51% 27.43% 34.34% 32.47% 25.39% Table 12: Error rate estimation for voting ensembles and error correlation In Table 12, we can see how the error ratio increases when the members of the ensemble make errors in a more correlated manner (φ), i.e., when the classifiers make the same error on an example, the ensemble will also make more errors than if the members of the ensemble make different errors. Given this explanation we might expect the error correlation also be related to the difference between the real accuracy of the ensemble and ˆp E. However, Table 12 shows that this is not true. Consider for example the case where all three classifiers agree on an error against the case where only two classifiers make the same error. Even when in both cases the error should be considered only once, the measure of error correlation (Equation 5) considers the ensemble error produced when the three experts agree on it as two errors, while in the other case it considers only one. Thus, depending on the distribution of the ensemble errors made by two or three classifiers, the difference between the error rate of the ensemble and the one estimated by ˆp E cannot be considered only in relation to φ. Since the actual ensemble will have an error rate higher than the one predicted by ˆp E, this measure could be regarded as an upper bound of the accuracy of a majority voting ensemble. However, this estimate of the error of the ensemble is only defined for three classifiers. A more general voting method is k-of-d voting, which is a voting ensemble E where E(x) =jif at least k of the D classifiers of the ensemble select j as the class of x. In [15], Matan proposed the Equations 6 as a way to obtain an upper and lower bound for the accuracy of a k-of-d majority voting ensembles. Considering that in our experiments we are using three classifiers and 2- of-3 majority ensembles, the upper and lower bound of the ensembles would be obtained by: 32

33 max(e) = min (1, 1 ) 2 (g 1 + g 2 + g 3 ),g 1 +g 2 ( min(e) = max 0, 1 ) 2 (g 1 + g 2 + g 3 1),g 2 +g 3 1 (11) given that g i,i=1,...,d is the accuracy of expert e i. Rec. rate E min(e) max(e) E B(E) E Voting min max min max min max CRX 83.96% 1.32% 6.54% 13.35% 18.18% 0.58% 2.51% PIMA 72.79% 19.79% 21.53% 24.65% 30.90% 0.00% 4.51% TTT 71.81% 15.83% 19.82% 26.11% 30.36% 0.28% 7.93% WBC 67.53% 11.04% 28.30% 24.32% 32.55% 1.35% 18.79% Table 13: Upper and Lower bounds for the accuracy of the voting ensemble Since four experiments were done for each data set, in Table 13 we show the minimum (min) and maximum (max) differences between the actual accuracy of the ensemble E and the bounds obtained by these equations in all the experiments over the same data. As we can see in this table, there is always a big difference between max(e) and the accuracy of the ensemble. In Section we defined a new estimator for a tighter upper bound in the accuracy that can be obtained by any of the considered ensemble methods. This estimator, for three classifiers and two classes is defined as: 3j1 3j2 3j3 =1 =1 =1 B(E) = n j 1 j 2 j 3 G(j 1 j 2 j 3 ) (12) N Figure 1 shows that this new estimator produces upper bounds closer to the real accuracy of the ensemble. Observe that for almost all the experiments the upper bound proposed by Matan obtains a maximum accuracy of 1, opposed to B(E) that represented a tighter bound that could predict the actual accuracy of the ensemble in some cases. Examples of this are the results obtained for the Pima Indians diabetes data sets, where both, the accuracy of the ensemble and B(E) are the same in one of the experiments, and 33

Naive Bayes classification

Naive Bayes classification Naive Bayes classification Christos Dimitrakakis December 4, 2015 1 Introduction One of the most important methods in machine learning and statistics is that of Bayesian inference. This is the most fundamental

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I SYDE 372 Introduction to Pattern Recognition Probability Measures for Classification: Part I Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 Why use probability

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Basic Probabilistic Reasoning SEG

Basic Probabilistic Reasoning SEG Basic Probabilistic Reasoning SEG 7450 1 Introduction Reasoning under uncertainty using probability theory Dealing with uncertainty is one of the main advantages of an expert system over a simple decision

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Uncertainty and Rules

Uncertainty and Rules Uncertainty and Rules We have already seen that expert systems can operate within the realm of uncertainty. There are several sources of uncertainty in rules: Uncertainty related to individual rules Uncertainty

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

1 Using standard errors when comparing estimated values

1 Using standard errors when comparing estimated values MLPR Assignment Part : General comments Below are comments on some recurring issues I came across when marking the second part of the assignment, which I thought it would help to explain in more detail

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Combination Methods for Ensembles of Multilayer Feedforward 1

Combination Methods for Ensembles of Multilayer Feedforward 1 Combination Methods for Ensembles of Multilayer Feedforward 1 JOAQUÍN TORRES-SOSPEDRA MERCEDES FERNÁNDEZ-REDONDO CARLOS HERNÁNDEZ-ESPINOSA Dept. de Ingeniería y Ciencia de los Computadores Universidad

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

CHAPTER-17. Decision Tree Induction

CHAPTER-17. Decision Tree Induction CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi Boosting CAP5610: Machine Learning Instructor: Guo-Jun Qi Weak classifiers Weak classifiers Decision stump one layer decision tree Naive Bayes A classifier without feature correlations Linear classifier

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com 1 School of Oriental and African Studies September 2015 Department of Economics Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com Gujarati D. Basic Econometrics, Appendix

More information

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification Robin Senge, Juan José del Coz and Eyke Hüllermeier Draft version of a paper to appear in: L. Schmidt-Thieme and

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

More on Unsupervised Learning

More on Unsupervised Learning More on Unsupervised Learning Two types of problems are to find association rules for occurrences in common in observations (market basket analysis), and finding the groups of values of observational data

More information

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han Math for Machine Learning Open Doors to Data Science and Artificial Intelligence Richard Han Copyright 05 Richard Han All rights reserved. CONTENTS PREFACE... - INTRODUCTION... LINEAR REGRESSION... 4 LINEAR

More information

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/

More information

Bayesian Decision Theory

Bayesian Decision Theory Bayesian Decision Theory Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Bayesian Decision Theory Bayesian classification for normal distributions Error Probabilities

More information

Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Machine Learning, Midterm Exam

Machine Learning, Midterm Exam 10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have

More information

Neural Networks and Ensemble Methods for Classification

Neural Networks and Ensemble Methods for Classification Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks A neural network is a set of connected input/output units (neurons) where each connection has a weight associated

More information

From inductive inference to machine learning

From inductive inference to machine learning From inductive inference to machine learning ADAPTED FROM AIMA SLIDES Russel&Norvig:Artificial Intelligence: a modern approach AIMA: Inductive inference AIMA: Inductive inference 1 Outline Bayesian inferences

More information

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20. 10-601 Machine Learning, Midterm Exam: Spring 2008 Please put your name on this cover sheet If you need more room to work out your answer to a question, use the back of the page and clearly mark on the

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen

More information

MATH2206 Prob Stat/20.Jan Weekly Review 1-2

MATH2206 Prob Stat/20.Jan Weekly Review 1-2 MATH2206 Prob Stat/20.Jan.2017 Weekly Review 1-2 This week I explained the idea behind the formula of the well-known statistic standard deviation so that it is clear now why it is a measure of dispersion

More information

Bayesian Concept Learning

Bayesian Concept Learning Learning from positive and negative examples Bayesian Concept Learning Chen Yu Indiana University With both positive and negative examples, it is easy to define a boundary to separate these two. Just with

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Short Note: Naive Bayes Classifiers and Permanence of Ratios

Short Note: Naive Bayes Classifiers and Permanence of Ratios Short Note: Naive Bayes Classifiers and Permanence of Ratios Julián M. Ortiz (jmo1@ualberta.ca) Department of Civil & Environmental Engineering University of Alberta Abstract The assumption of permanence

More information

A new Approach to Drawing Conclusions from Data A Rough Set Perspective

A new Approach to Drawing Conclusions from Data A Rough Set Perspective Motto: Let the data speak for themselves R.A. Fisher A new Approach to Drawing Conclusions from Data A Rough et Perspective Zdzisław Pawlak Institute for Theoretical and Applied Informatics Polish Academy

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Lossless Online Bayesian Bagging

Lossless Online Bayesian Bagging Lossless Online Bayesian Bagging Herbert K. H. Lee ISDS Duke University Box 90251 Durham, NC 27708 herbie@isds.duke.edu Merlise A. Clyde ISDS Duke University Box 90251 Durham, NC 27708 clyde@isds.duke.edu

More information

Bayesian Reasoning. Adapted from slides by Tim Finin and Marie desjardins.

Bayesian Reasoning. Adapted from slides by Tim Finin and Marie desjardins. Bayesian Reasoning Adapted from slides by Tim Finin and Marie desjardins. 1 Outline Probability theory Bayesian inference From the joint distribution Using independence/factoring From sources of evidence

More information

Study on Classification Methods Based on Three Different Learning Criteria. Jae Kyu Suhr

Study on Classification Methods Based on Three Different Learning Criteria. Jae Kyu Suhr Study on Classification Methods Based on Three Different Learning Criteria Jae Kyu Suhr Contents Introduction Three learning criteria LSE, TER, AUC Methods based on three learning criteria LSE:, ELM TER:

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Due Thursday, September 19, in class What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Topic 3: Hypothesis Testing

Topic 3: Hypothesis Testing CS 8850: Advanced Machine Learning Fall 07 Topic 3: Hypothesis Testing Instructor: Daniel L. Pimentel-Alarcón c Copyright 07 3. Introduction One of the simplest inference problems is that of deciding between

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

A novel k-nn approach for data with uncertain attribute values

A novel k-nn approach for data with uncertain attribute values A novel -NN approach for data with uncertain attribute values Asma Trabelsi 1,2, Zied Elouedi 1, and Eric Lefevre 2 1 Université de Tunis, Institut Supérieur de Gestion de Tunis, LARODEC, Tunisia trabelsyasma@gmail.com,zied.elouedi@gmx.fr

More information

Decision Trees: Overfitting

Decision Trees: Overfitting Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

Single Maths B: Introduction to Probability

Single Maths B: Introduction to Probability Single Maths B: Introduction to Probability Overview Lecturer Email Office Homework Webpage Dr Jonathan Cumming j.a.cumming@durham.ac.uk CM233 None! http://maths.dur.ac.uk/stats/people/jac/singleb/ 1 Introduction

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Chapter 14 Combining Models

Chapter 14 Combining Models Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients

More information

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel Lecture 2 Judging the Performance of Classifiers Nitin R. Patel 1 In this note we will examine the question of how to udge the usefulness of a classifier and how to compare different classifiers. Not only

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

BAYESIAN DECISION THEORY

BAYESIAN DECISION THEORY Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Numerical Learning Algorithms

Numerical Learning Algorithms Numerical Learning Algorithms Example SVM for Separable Examples.......................... Example SVM for Nonseparable Examples....................... 4 Example Gaussian Kernel SVM...............................

More information

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Günther Eibl and Karl Peter Pfeiffer Institute of Biostatistics, Innsbruck, Austria guenther.eibl@uibk.ac.at Abstract.

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Announcements Machine Learning Lecture 2 Eceptional number of lecture participants this year Current count: 449 participants This is very nice, but it stretches our resources to their limits Probability

More information

arxiv: v1 [cs.ai] 4 Sep 2007

arxiv: v1 [cs.ai] 4 Sep 2007 Qualitative Belief Conditioning Rules (QBCR) arxiv:0709.0522v1 [cs.ai] 4 Sep 2007 Florentin Smarandache Department of Mathematics University of New Mexico Gallup, NM 87301, U.S.A. smarand@unm.edu Jean

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information

Introduction to Bayesian Inference

Introduction to Bayesian Inference Introduction to Bayesian Inference p. 1/2 Introduction to Bayesian Inference September 15th, 2010 Reading: Hoff Chapter 1-2 Introduction to Bayesian Inference p. 2/2 Probability: Measurement of Uncertainty

More information

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Midterm 2 V1. Introduction to Artificial Intelligence. CS 188 Spring 2015

Midterm 2 V1. Introduction to Artificial Intelligence. CS 188 Spring 2015 S 88 Spring 205 Introduction to rtificial Intelligence Midterm 2 V ˆ You have approximately 2 hours and 50 minutes. ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib

More information

Lecture Notes 7 Random Processes. Markov Processes Markov Chains. Random Processes

Lecture Notes 7 Random Processes. Markov Processes Markov Chains. Random Processes Lecture Notes 7 Random Processes Definition IID Processes Bernoulli Process Binomial Counting Process Interarrival Time Process Markov Processes Markov Chains Classification of States Steady State Probabilities

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

STA Module 4 Probability Concepts. Rev.F08 1

STA Module 4 Probability Concepts. Rev.F08 1 STA 2023 Module 4 Probability Concepts Rev.F08 1 Learning Objectives Upon completing this module, you should be able to: 1. Compute probabilities for experiments having equally likely outcomes. 2. Interpret

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

MATH 556: PROBABILITY PRIMER

MATH 556: PROBABILITY PRIMER MATH 6: PROBABILITY PRIMER 1 DEFINITIONS, TERMINOLOGY, NOTATION 1.1 EVENTS AND THE SAMPLE SPACE Definition 1.1 An experiment is a one-off or repeatable process or procedure for which (a there is a well-defined

More information

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual Outline: Ensemble Learning We will describe and investigate algorithms to Ensemble Learning Lecture 10, DD2431 Machine Learning A. Maki, J. Sullivan October 2014 train weak classifiers/regressors and how

More information

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Steven Bergner, Chris Demwell Lecture notes for Cmpt 882 Machine Learning February 19, 2004 Abstract In these notes, a

More information

Statistical Learning. Philipp Koehn. 10 November 2015

Statistical Learning. Philipp Koehn. 10 November 2015 Statistical Learning Philipp Koehn 10 November 2015 Outline 1 Learning agents Inductive learning Decision tree learning Measuring learning performance Bayesian learning Maximum a posteriori and maximum

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 3

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 3 CS434a/541a: attern Recognition rof. Olga Veksler Lecture 3 1 Announcements Link to error data in the book Reading assignment Assignment 1 handed out, due Oct. 4 lease send me an email with your name and

More information

Multivariate statistical methods and data mining in particle physics

Multivariate statistical methods and data mining in particle physics Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general

More information

Nearest Neighbor Pattern Classification

Nearest Neighbor Pattern Classification Nearest Neighbor Pattern Classification T. M. Cover and P. E. Hart May 15, 2018 1 The Intro The nearest neighbor algorithm/rule (NN) is the simplest nonparametric decisions procedure, that assigns to unclassified

More information

Selection of Classifiers based on Multiple Classifier Behaviour

Selection of Classifiers based on Multiple Classifier Behaviour Selection of Classifiers based on Multiple Classifier Behaviour Giorgio Giacinto, Fabio Roli, and Giorgio Fumera Dept. of Electrical and Electronic Eng. - University of Cagliari Piazza d Armi, 09123 Cagliari,

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag Decision Trees Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Supervised Learning Input: labelled training data i.e., data plus desired output Assumption:

More information

Chapter 1 (Basic Probability)

Chapter 1 (Basic Probability) Chapter 1 (Basic Probability) What is probability? Consider the following experiments: 1. Count the number of arrival requests to a web server in a day. 2. Determine the execution time of a program. 3.

More information

Multicomponent DS Fusion Approach for Waveform EKG Detection

Multicomponent DS Fusion Approach for Waveform EKG Detection Multicomponent DS Fusion Approach for Waveform EKG Detection Nicholas Napoli University of Virginia njn5fg@virginia.edu August 10, 2013 Nicholas Napoli (UVa) Multicomponent EKG Fusion August 10, 2013 1

More information

Chapter 6 Classification and Prediction (2)

Chapter 6 Classification and Prediction (2) Chapter 6 Classification and Prediction (2) Outline Classification and Prediction Decision Tree Naïve Bayes Classifier Support Vector Machines (SVM) K-nearest Neighbors Accuracy and Error Measures Feature

More information

MATH MW Elementary Probability Course Notes Part I: Models and Counting

MATH MW Elementary Probability Course Notes Part I: Models and Counting MATH 2030 3.00MW Elementary Probability Course Notes Part I: Models and Counting Tom Salisbury salt@yorku.ca York University Winter 2010 Introduction [Jan 5] Probability: the mathematics used for Statistics

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information