Bayesian Networks for Classification

Size: px
Start display at page:

Download "Bayesian Networks for Classification"

Transcription

1 Finite Mixture Model of Bounded Semi-Naive Bayesian Networks for Classification Kaizhu Huang, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, New Territories, Hong Kong Abstract The Naive Bayesian (NB) network classifier, a probabilistic model with a strong assumption of conditional independence among features, shows a surprisingly competitive prediction performance even when compared with some state-of-the-art classifiers. With a looser assumption of conditional independence, the Semi-Naive Beyesian (SNB) network classifier is superior to NB classifiers when features are combined. However, the problem for SNB is that its structure is still strongly constrained which may generate inaccurate distributions for some datasets. A natural progression to improve SNB is to extend it using the mixture approach. However, in obtaining the final structure, traditional SNBs use the heuristic approaches to learn the structure from data locally. On the other hand, Expectation- Maximization (EM) method is used in the mixture approach to obtain the structure iteratively. The extension is difficult to integrate the local heuristic into the maximization step since it may not convergence. In this paper we firstly develop a Bounded Semi-Naive Bayesian network (B-SNB) model, which contains the restriction on the number of variables that can be joined in a combined feature. As opposed to local property of the traditional SNB models, our model enjoys a global nature and maintains a polynomial time cost. Overcoming the difficulty of integrating SNBs into the mixture model, we then propose an algorithm to extend it into a finite mixture structure, named Mixture of Bounded Semi-Naive Bayesian network (MBSNB). We give theoretical derivations, outline of the algorithm, analysis of algo-

2 rithm and a set of experiments to demonstrate the usefulness of MBSNB in some classification tasks. The novel finite MBSNB network shows good speed up, ability to converge and an increase in prediction accuracy. I. INTRODUCTION Learning accurate classifiers is one of the basic problems in machine learning and data analysis field. In such a problem, a pre-classified dataset D = {{x 1, C 1 }, {x 2, C 2 },..., {x N, C N }} is given, where x i = {A i 1, Ai 2,..., Ai n } Rn, is the n-dimension training sample in real space and C i Ω is the class label assigned to x i in a category space. The objective is to find a mapping function F : R n Ω to satisfy F (x i ) = C i. To handle this problem, many methods have been proposed. Among them are Statistical Neural Networks [23], Support Vector Machines [2] [36] and Decision trees [29]. Naive Bayesian Network (NB) [8] [20] shows a good performance in dealing with this problem even when compared with the state-of-the-art classifiers such as C4.5. With an independency assumption among the attributes, when given the class label, i.e., P (A i, A j C) = P (A i C)P (A j C), for 1 i j n, NB classifies a specific sample into the class with the largest joint probability P (C, A 1, A 2,..., A n ). Figure 1 is the graphical structure for NB. This joint probability can be decomposed into a multiplication form based on its independency assumption. The mapping function is written as follows: c = arg max C i P (C i, A 1, A 2,..., A n ) = arg max P (C i ) C i n P (A j C i ) (1) j=1 The success of NB is somewhat unexpected since its independency assumption will be violated in many cases. A typical example is the so-called Xor problem. In such a problem, attributes are two binary variables A and B. And when A is the same as B, the class label C is set to 1; otherwise, C is set to 0. Thus the attribute A is not independent on B, when given the class variable C. NB encounters problems in classifying the Xor data. The reason is that P (C = 0), P (C = 1), P (A = 0 C = 0), P (A = 1 C = 0), P (A = 0 C = 1), P (A = 1 C = 1) will be all nearly 0.5 when the data samples are sufficient. It

3 C A 1 A 2 A n-1 A n Fig. 1. Naive Bayesian Network Classifier. A i, 1 i n is the attribute. This figure means each attribute is independent on the other attributes, given the class label C. C B B1 2 m-1 m B B Fig. 2. Semi-Naive Bayesian Network Classifier. B i, 1 i m is the combined attribute. This figure means each combined attribute B i is independent on the other combined attributes B j, i j, given the class label C. will be hard to assign any data into the class 0 or 1 since the estimated joint probabilities according to Equation (1) for both classes will be about = Furthermore, so-called Semi-Naive Bayesian networks are proposed to remedy violations of NB s assumption by joining attributes into several combined attributes based on a conditional independency assumption among the combined attributes. Some performance improvements have been demonstrated in [17] [25]. Figure 2 is a graphical illustration of Semi-Naive Bayesian network. At this time, the conditional independency occurs among the combined attributes. However, even SNB makes the constraint of NB looser, such a relaxation is slight. SNB is still strongly restrained for its conditional independency assumption among the combined attributes. How to relax the

4 SNB s strong constraint effectively and efficiently becomes a popular topic. One possible way to solve this problem is to search an independence or dependence relationship among the attributes rather than impose a strong assumption on the attributes. This is the main idea of so-called unrestricted Bayesian Network (BN) [28]. Unfortunately, empirical results have demonstrated that searching an unrestricted BN structure does not show a better result than NB. This is partly because that unrestricted BN structures are prone to incurring overfitting problems (Overfitting is a such a phenomenon that the classifier can classify the training dataset perfect while it shows a low prediction accuracy for new data.). Furthermore, searching an unrestricted BN structure is generally an NP-complete problem [3]. Another possible way is to upgrade the SNB into a mixture structure, where a hidden variable is used to coordinate its components: SNB structures. Mixture approaches have achieved great successes in expanding its restricted components expressive power and bring in a better performance. Gaussian Mixture Model [22] is such an example. To our best knowledge, compared with the popularity in seaching unrestricted BN to relax the constraints of SNB, there is no one to do mixture upgrading work on SNB structure. This strange situation is perhaps partly for the difficulties of inducing mixture parameters for SNB. The core technique for mixture model is Expectation Maximization (EM) method, which requires a global optimization algorithm to obtain the structure or parameters of each component. This optimization algorithm with a global nature will guarantee the maximization step in EM and then guarantee the objective function will increase iteratively. In this way, the iterative EM process can go to a convergence point. However traditional algorithms of SNB [17] [25] often search their structures based on some heuristic methods, for the reason of large search space of SNB. While these heuristic approaches decrease the total time cost, they bring in a local nature as well, which will prevent them into the mixture structure. In this paper, we firstly propose a Bounded SNB (B-SNB) model. As opposed to local heuristic of the traditional SNBs, B-SNB is shown to be of global nature and has a polynomial time cost when given a bound on the maximum number of attributes that can be joined into the combined attributes (We call large

5 attribute in our B-SNB model). We search the B-SNB structure based on an combinatorial optimization method. As far as we know, this is the first purely combinatorial formulation of the learning problem for constructing SNB structure and the first algorithm with a global nature for SNB as well. With this B-SNB model, we provide an algorithm to perform the mixture structure upgrading, based on the EM method. Our experimental results show that this upgrading really bring in an increase in prediction rate for some benchmark datasets. This paper is organized as follows. In Section II, we give a short review about the related works. In Section III, we describe our B-SNB model in detail. Then in Section IV, we discuss the mixture of B- SNB model and give an induction algorithm. In Section V, a complexity analysis is given. Experimental results to show the advantages of our model are demonstrated in Section VI. A discussion section is given in Section VII. Finally, we conclude this paper in Section VIII. II. RELATED WORKS Since the invention of NB, many BN classifiers have been developed, including restricted types and unrestricted types. Among the restricted BN classifiers are Semi-Naive Bayesian networks classifiers [17] [25], Selective Naive Bayesian network classifiers [21], Recursive Bayesian classifier [19], Tree Augmented Naive Bayesian networks classifiers [11], Limited Bayesian network classifiers [30] and Adjusted probability Naive Bayesian networks classifiers [37]; And for inducing unrestricted Bayesian network classifiers, K2 [5] is a popular algorithm. Since our focus is in the restricted BNs, in the following, we will first give a short review about restricted ones and then we shift the focus on the mixture issues. NB s success triggers the popularity of Bayesian network classifier. Numbers of algorithms are developed to relax the strong assumption of NB. Kononenko [17] invented a Semi-Naive Bayesian network classifier which attempted to combine values of attributes to overcome the shortcomings of NB. However this proposition did not show a significantly better performance in two datasets experiments. In one dataset, Kononenko s algorithm has the same performance as NB. In the other dataset testing, only one percentage is increased compared with the NB performance. Rather than combines values, Pazzani [25] combined

6 attributes heuristically. He used either forward sequential selection and joining strategy or backward sequential elimination and joining strategy. For this time-consuming search process, Pazzani s algorithm is limited to combine only two attributes. Even in such a restriction, Pazzani s algorithm shows a big increase in classification accuracy compared with NB. Langley and Sage brought out a Selective Bayesian Classifier to eliminate dependent attributes from NB [21]. Different with Pazzani s approach, they threw away one of the two dependent attributes while Pazzani joined these two dependent attributes. This model s performance is shown to be slightly worse than the Pazzani s approach. Langley [19] proposed a Recursive Bayesian classifiers to adapt NB into some non-linearly separable problems. However it did not provide a significant benefit on naturally occurring databases [25]. Friedman et al. [11] developed so-called Tree Augmented Naive Bayesian (TAN) network classifier which integrated the Chow-Liu tree (CLT) [4] techniques with NB. Chow-Liu tree is a kind of tree structure in which each node (attribute) is assumed to have only one node (attribute) as its parents. In such a configuration, a global optimal tree structure can be found. By using Chow-Liu tree technique, TAN enjoys the globally optimal nature based on the assumption of TAN, i.e., each attribute has only the class label and another attributes as its parents. Its performance is demonstrated to be good against state-of-the-art classification methods in machine learning. All of the above models are strongly restricted. For example, Pazzani can only combines two attributes and in TAN each attribute has only one attribute as its parent node. These restrictions enable inductions in these models easily and conveniently. At the same time, these restrictions confine these models expression powers which are the main advantages as Bayesian networks. In the background that unrestricted Bayesian networks with a powerful expression ability cannot incur an increase in prediction accuracy, finding another path to upgrade the restricted Bayesian network is getting important. Meila and Jordan [24] proposed a mixture of trees (MT) model to expand the Chow-Liu tree s expression power based on EM algorithm. Their model is empirically shown to outperform other models such as C4.5

7 Z BN 1 BN 2 BN m-1 BN m Fig. 3. Mixture structure of Bayesian Network Classifier. BN i, 1 i m is the restricted Bayesian network and Z is the choice variable. and Mixture of factorial model in a lot of datasets in prediction accuracy. Motivated from this work, we wonder if it is possible for us to expand SNB classifiers into the mixture structure. See Figure 3, Z is a choice variable, which is used to condition the component restricted Bayesian networks. Learning the mixture structure and parameters, is often done through the EM algorithms [12] [22]. To maintain the convergence of EM we need to find a globally optimal or at least sub-optimal algorithm for constructing the component restricted Bayesian networks. This global optimality will maintain the value of the objective function will be increased in each iteration, which thus ensure the convergence of iterative process. The fact in learning the Bayesian networks classifiers is that some heuristic methods are used to search a good network structure rather than an optimal one (An exception is TAN, whose upgrading can be actually considered as MT). This is mainly for saving the computational cost. Thus in this paper, we firstly utilize combinatorial optimization technique to develop a kind of sub-optimal algorithm with polynomial time cost for Bounded Semi-Naive Bayesian network. To borrow combinatorial optimization technique into the learning structure from data is reported firstly in [14] and [31]. They aimed at finding an approximation of the optimal hyper graph structure by the combinatorial technique. Their work s contribution may be in the theoretic field not in the real application, since their approximation proportion to the optimal solution is about 1/324 even given a very strong constraint. Breaking through the bottleneck for mixture upgrading, we then propose a Mixture model of Bounded Semi-Naive Bayesian network, which is shown to outperform NB, CLT, and SNB in our experiments.

8 In fact Thiesson [34] et al. have proposed a mixture of general Bayesian networks. However its performance cannot be expected to be very promising since its components: unrestricted Bayesian network classifiers are not shown to be better than NB. III. BOUNDED SEMI-NAIVE BAYESIAN NETWORK Our Bouned Semi-Naive Bayesian network model is defined as follows: Definition 1: B-SNB Model : Given a set of N independent observations D = {x 1,..., x N } and a bound K, where x i = (A i 1, A i 2,..., A i n) is an n-dimension vector and A 1, A 2,..., A n are called variables or attributes, B-SNB is a maximum likelihood Bayesian network which satisfies the following conditions: 1) It is composed of m large attributes B 1, B 2,..., B m, 1 m n, where each large attribute B l = {A l1, A l2,..., A lkl } is a subset of attribute set:{a 1, A 2,..., A n 1, A n }. 2) There is no overlap among the large attributes and their union forms the attributes set. That is, B i B j = φ, for i j, and 1 i, j m, B 1 B 2... B m = {A 1, A 2,..., A n } (2) 3) B i is independent of B j for i j. Namely, P (B i, B j ) = P (B i )P (B j ) for i j, and 1 i, j m 4) The cardinality of each large attribute B l ( 1 l m) is not greater than K. If each large attribute has the same cardinality K, we call the B-SNB K-regular B-SNB. According to the above definition, the distribution encoded into this network, denoted by S, can be written into: m S(A 1, A 2,..., A n ) = P (B j ) (3) Except Item 4), the B-SNB model definition is the definition of the traditional SNB. We argue that this constraint on the cardinality is necessary. K cannot be set as a very large value, or it will incur an overfitting j=1

9 problem. It can be verified that when the K is set as n, the B-SNB distribution will be the empirical distribution, which will be extraordinary unreliable to represent the dataset especially when the the sample number is not sufficient. We should address that the above model is described in a supervised way for classification purpose. When using B-SNB for classification tasks, we first partition the pre-classified dataset into some sub-datasets by the class label and then train different B-SNB structures for different classes. From this viewpoint, Item 3) is actually a conditional independence formulation, when given the class variable, since this independency is assumed in the sub-database with an uniform class label. A. Learning the Optimal B-SNB from Data According to Maximum Likelihood Estimation criterion, the data log-likelihood over the best B-SNB for a given dataset should be the maximum value than over any other B-SNBs. The data log-likelihood over a specific B-SNB S, denoted by l S, can be written as: N l s = log(s(x i )) (4) i=1 In general, the optimal B-SNB estimated from a dataset D can be achieved in two steps. The first step is to learn an optimal B-SNB structure from D; the second step is to learn the optimal parameters for this optimal structure, where B-SNB parameters are those probabilities of each large attribute, i.e., P (B j ). It is easy to show that the sample frequency of a large attribute B j is the maximum-likelihood estimator for the probability P (B j ), when a specific B-SNB structure is given (see the Appendix). Thus the key problem in learning the optimal B-SNB is the structure learning problem, namely how to find the best m large attributes. However the total number of possible structures of the B-SNB for an n-dimension dataset will be a huge quantity. It can be verified that this quantity will be: where, {k 1,k 2,...,k n} G n! k 1!k 2!..k n!, G = {{k 1, k 2,..., k n } : k 1 + k k n = n, 0 k i k j K, 1 i j n}

10 Such a huge searching space for an optimal B-SNB will make it nearly impossible to take greedy methods especially when the K has to be set as some small value for the reliable probability estimation of large attributes. This is why the current SNB models [25] [17] have to take some heuristic methods to search the SNB structure. However as mentioned in Section I, these heuristic methods will bring a local nature together with the lower time cost, which is a main obstacle for the mixture upgrading. Different with the local heuristic method, we introduce another restriction condition to reduce the searching space, which will maintain the global nature of the solution of B-SNB problem as well. B. Reducing B-SNB Search Space To reduce the search space, we first describe two lemmas: Lemma 1: The maximum log likelihood of a specific B-SNB S for a dataset D, represented by l S, can be written into the following form: m l S = Ĥ(B i ), (5) i=1 where Ĥ(B i) is the entropy of large attribute B i based on the empirical distribution of D. The entropy among a k-large attribute {X 1, X 2,..., X k } can be defined as: Ĥ(X 1, X 2,..., X k ) = ˆP (x1,..., x k ) log ˆP (x 1,..., x k ), (6) X 1,...,X k where low-case character x i represents the assignment of the value to the variable X i, 1 i k and ˆP (x 1,..., x k ) is the empirical probability when the value of the large attribute (X 1, X 2,..., X k ) is equal to (x 1, x 2..., x k ). Lemma 2: Let µ and µ be two B-SNBs over dataset D. If µ is coarser than µ, then µ provides a better approximation than µ over D. The coarser concept is defined in this way: If µ can be obtained by combining the large attributes of µ without splitting the large attribute of µ, then µ is coarser than µ.

11 Lemma 1 describes that when the B-SNB structure is given, the maximum likelihood estimator for the parameters will be the sample frequency of the large attributes and the log likelihood is the negative summation of the entropies of large attributes. Lemma 2 describes such a truth that a higher order approximation will be superior to a lower order one when given a bound K for the cardinality of large attributes. Within the bound K that is used to maintain the reliable probability estimation, a higher order one will keep more information of the dataset than the lower order one. For example, P (a, b, c)p (d, e, f) will be more accurate than P (a, b)p (c)p (d, e, f) for approximating P (a, b, c, d, e, f), when each subitem probability can be estimated reliably. The details of the proof of Lemma 1 and Lemma 2 can be seen in Appendix. According to Lemma 2, given a bound K which is used to guarantee us the reliable probability estimation of large attributes, we should use large attribute with as high cardinality as possible. Or it is more possible that we can combine some of large attribute with small cardinalities into a new large attribute within the bound K. Thus the new SNB will be coarser than the old one. From this viewpoint, we add another constraint that the cardinality of each large attribute should be exactly bound K. This constraint is reasonable, since no SNBs coarser than a K-regular B-SNB exist when the bound is K. It should be noticed that a K-regular B-SNB is not absolutely better than a non-k-regular SNB with the biggest cardinality no more than K since obvious some non-k-regular SNBs cannot be combined into a K-regular SNB. However, we can see that the possible number of the B-SNB after the restriction will be about integer. n! (K!) [n/k], a much fewer number than the original one. Here [x] means rounding the x to the nearest To show that the reduction is much useful, we plot the possible number of B-SNB (K = 2) in the search space without reduction and with reduction in Figure 4. In the bottom subfigure of Figure 4, we also plot out the ratio of search space without reduction and with reduction. Two issues can be seen from this figure. Firstly the search space can be observed to be greatly reduced by our method. Secondly although the search space is much reduced, it is still expensive to perform a greedy search. Our method is to transform the B-SNB combinatorial optimization problem into an Integer Programming (IP) problem and then find the

12 Y1 (Search Space size) Y2 (Search Space size) R (Ratio of seach spaces) The curve of the original seach space for B SNB (K=2) ( a ) Variable Number The curve of the seach space after reduction for B SNB (K= 2) ( b ) Variable Number 10 R=Y1/Y ( c ) Variable Number Fig. 4. The graph of search space of the B-SNB when K is equal to 2. (a) The original search space for B-SNB. (b) The search space after the reduction. (c) The ratio of search space without reduction and with reduction. Note that: the vertical axises for three subfigures are plotted according to the log scale. optimization solution. We further approximate the solution of IP based on the Linear Programming (LP) method, which can be solved in a polynomial computational cost. C. Transforming into Integer Programming Problem We firstly describe our B-SNB optimization problem under Maximum Likelihood Estimation criterion when the cardinality of each large attribute is constrained to be exactly bound K. B-SNB Optimization Problem: From the attributes set, find m = [n/k] K-cardinality subsets, which satisfy the B-SNB conditions, to maximize the log likelihood as shown in Equation (5). We write this B-SNB optimization problem into the following IP problem: Min x V1,V 2,...,V K Ĥ(V 1, V 2,..., V K ), (7) V 1,V 2,...,V K under the constraints: ( V K ) x V1,V 2,...,V K = 1, (8) V 1,V 2,...,V K 1 x V1,V 2,...,V K = {0, 1} (9)

13 Here V 1, V 2,..., V k represent any K attributes. Equation (8) and Equation (9) describe that for any attribute, it can just belong to one large attribute, i.e., when it occurs in one large attribute, it must not be in another large attribute, since there is no coverage among the large attributes. We approximate the solution of IP via Linear Programming (LP) method, which can be solved in a polynomial time. By relaxing x V1,V 2,...,V K = {0, 1} into 0 x V1,V 2,...,V K 1, the IP problem is transformed into an LP problem. Then a rounding procedure to get the integer solution is conducted on the solution of LP. It should be addressed that direct solving for IP problem is infeasible. It is reported that IP problems with as few as 40 variables can be beyond the abilities of even the most sophisticated computers [35]. We assume the set of all the possible large attributes {V 1, V 2,..., V K } as X. The rounding scheme is written as follows: Rounding Scheme: 1) Set the maximum x V1,V 2,...,V K for the large attributes in X to value 1, record its subscript as a large attribute {V M1, V M2,..., V MK }, delete this {V M1, V M2,..., V MK } from X. 2) Set all the coefficients x V1,V 2,...,V K of those large attributes to 0, which have the coverage with {V M1, V M2,..., V MK }. Delete all these large attributes from X. 3) Goto 1, until all the attributes are covered. Approximating IP solution by LP may reduce the accuracy of the SNB while it can decrease the computational cost. Shown in [13] for two real world datasets experiments, the LP solution is the satisfcatory approximation on IP problem. D. When n/k is not an integer Problems may be encountered when n cannot be divided by K exactly, i.e., (n mod K)= l 0. In this case, we will not be able to find a K-Regular-B-SMB since one large attribute will always have only l attributes. In solving this problem, we proposed two modified versions of the previous algorithm. The first one is to delete l attributes, which are the least important for the classifications. And the classification result is given just based on the remaining n l attributes. The least important l-cardinality large attribute is the

14 one with the maximum entropy. A simple example can illustrate the above deletion. Assume l = 1 and each attribute is a 0-1 variable. It is obvious that the least important attribute should be the one with the almost the same frequency for 0 and 1, i.e., P (A i = 0) 0.5 and P (A i = 1) 0.5, since this attribute is not much helpful for the discrimination. It can be seen that this kind of attribute is just the one with the maximum entropy. In conclusion, the first modification approach is described as follows: 1) Assume (n mod K)= l 0, among all the l-cardinality -subset of the attributes set, select the one which has the maximum entropy. We assume this l-subset is B l max = {A max 1, A max2,..., A maxl } (The superscript l in B l max implies that this is an l-large attribute.). Let W = {A 1, A 2,..., A n 1, A n }\B l max. 2) Perform the optimization on the attributes set W as shown in Subsection III-C. We assume the resulting K-cardinality large attributes found by the above modification approach are B K i, 1 i [n/k].the classification mapping function is given by: [n/k] c = arg max P (C j ) P Cj (Bi K ) (10) C j The second approach is that an l-large attribute is first extracted from the attributes set and then search a regular B-SNB in the remaining n l attributes. Different with the first one, the final classification decision is given still based on n attributes. And since the l-large attribute is involved in classification, it should be the most important l-cardinality large attribute for the classification. The most important l-large attribute is the one with the minimum entropy. From Lemma 1 we know that to maximize the log likelihood, the entropy of every large attribute should be as small as possible. That is why we first select the l-cardinality large attribute with the minimum entropy among all the l-cardinality large attributes. The algorithm can be described as follows: 1) Assume (n mod K)= l 0, among all the l-subset of the attributes set, select the one which has the i=1 minimum entropy. We assume this l-subset is B l min = {A min 1, A min2,..., A minl } (The superscript l in B l min implies that this is an l-large attribute.). Let W = {A 1, A 2,..., A n 1, A n }\B l min. 2) Perform the optimization on the attributes set W as shown in Subsection III-C.

15 The final classification mapping function is given by: [n/k] c = arg max P (C j )P Cj (B l C j mi) P Cj (Bi K ) (11) The advantage of the first modification approach is that the structure after deleting the l-cardinality large attribute will be the one with a global nature for the remaining n l attributes. This global nature will greatly benefit the convergence of the mixture upgrading. However, for the second approach, with the l- large attribute involved in the classification decision rule, the resulting structure with the l-cardinality large attribute will be of the local property, since we should first choose out this l-large attribute. On the other hand, keeping all the attributes in the decision rule may keep stay the corresponding attribute information, which may do some help for the classification. Considering the characteristics of these two approaches, in this paper, we use the second approach for constructing a separate B-SNB classifier and we use the first approach for constructing B-SNB in the mixture structure. i=1 IV. THE MIXTURE OF BOUNDED SEMI-NAIVE BAYESIAN NETWORK As mentioned in the previous sections, the key for the mixture upgrading of SNB networks lies on the algorithm to optimize the SNB problem. Different with traditional local heuristic SNB approaches, We perform direct combinatorial optimization on the network, which enables the B-SNB to have a global nature. Thus it will make it possible to upgrade SNB into a finite mixture structure. In this section, we first define the Mixture of Bounded Semi-Naive Bayesian network (MBSNB) model, then we give the optimization problem of the MBSNB model. Finally we conduct theoretical induction to provide the optimization algorithm for this problem under the EM [18] framework. Definition 2: Mixture of Bounded Semi-Naive Bayesian network model is defined as a distribution of the form: r Q(x) = λ k S k (x) (12) k=1

16 where λ k 0, k = 1,..., r, rk=1 λ k = 1, r is the number of components in the mixture structure. S k represents the distribution of the kth component K Bounded Semi-Naive network. λ k can be called component coefficient. Optimization Problem of MBSNB: Given a set of N independent observations D = {x 1, x 2,..., x N } and a bound K, find the mixture of K-Bounded-SNB model Q, which satisfies Q = arg max Q N log Q (x i ). (13) i=1 We use a modified derivation process as [24] to find the solution of the above optimization problem. According to EM algorithm, finding the solution of (13) is equal to maximizing the following complete log-likelihood function: l c (x 1,...,N, z 1,...,N Q) = = N r log (λ k S k (x i )) δ k,z i i=1 k=1 N r δ k,z i(log λ k + log S k (x i )) (14) i=1 k=1 where z is the choice variable which can be seen as the hidden variable to determine the choice of the component Semi-Naive structure; δ k,z i is equal to 1 when z i is equal to the kth value of choice variable and 0 otherwise. We utilize the EM algorithm to find the solution of Equation (14). Firstly taking the expectation of Equation (14) with respect to z, we will obtain N r E[l c (x 1,...,N, z 1,...,N Q)] = E(δ k,z i D)(log λ k + log S k (x i )) (15) i=1 k=1 where E(δ k,z i D) is actually the posterior probability given the ith observation, which can be calculated as: E(δ k,z i D) = P (z i V = x i ) = λ ks k (x i ) k λ k Sk (x i ) (16) We define N γ k (i) = E(δ k,z i D), Γ k = γ k (i), i=1 P k (x i ) = γ k(i) Γ k. Thus we obtain the expectation: r r E[l c (x 1,...,N, z 1,...,N N Q)] = Γ k log λ k + Γ k P k (x i ) log S k (x i ). (17) k=1 k=1 i=1

17 Then we perform Maximization step in (17) with respect to the parameters. It is easy to maximize the first part of (17) by Lagrange method with the constraint r k=1 λ k = 1. We can obtain: λ k = Γ k N k = 1,..., r. (18) If we consider P k (x i ) as the probability for each observation over the kth component B-SNB, the latter part of Equation (17) is in fact a B-SNB network optimization problem, which can be solved by our earlier proposed algorithm in Section III. It can be also seen here that the B-SNB solution should have a global nature. Or it is highly possible that this maximization will not be greater than the maximum value in the previous M step. Thus the EM algorithm will not converge to a fix point. Our B-SNB solution based on direct combinatorial optimization enables a global nature of the resulting B-SNB structure, even though our LP approximation will reduce this optimality a little bit. Therefore the finite mixture upgrading based on our B-SNB model will have a good convergence performance. We demonstrate this issue in the experiment as well. We show the optimization process as Algorithm 1. V. COMPUTATIONAL COMPLEXITY ANALYSIS In this section, we conduct a simple computational complexity analysis first on B-SNB and then for MBSNB. A strong empirical evidence shows that classical LP optimization methods such as simplex only takes O(w) iterations to find an optimal solution with w equality constraints [1]. Each iteration costs O(wN) arithmetic operations where N is the number of variables to be solved. For our LP problem of Equation (7) in B-SNB, there are totally N = Cn K variables x V 1,V 2,...,V K which need to be solved and w is equal to n. Accordingly the computational cost in our B-SNB optimization process is about n 2 Cn K. On the other hand, q K C K n operations are needed to calculate the K-variable entropy in Equation (7). Here q is the maximum number of values a variable can take on. Accordingly the total cost for finding the optimal B-SNB will be (n 2 + q K )C K n. It will be a O(nK+2 ) time cost, when K n.

18 Algorithm Mix-Semi(D,Q 0 ) input(dataset D = {x 1, x 2,..., x N }, model Q 0 = {r, S k, λ k, k = 1,..., r} ) repeat E step: Compute i γ k, P k (x i ) for k = 1,..., r, i = 1,..., N ; M step; for k = 1,..., r do λ k Γ k /N; endfor until convergence; S k = B SNB(P k ) output(model Q = {r, S k, λ k, k = 1,..., r}) Algorithm 1: Mixture of Semi-Naive Bayesian network algorithm However, in the traditional SNB [17], the computational cost is exponential. It is said that: the number of iterations over the training dataset is approximately equal to the number of values of all attributes. For a simple example in which every variable has q values, the combination cost will be q n, an exponential cost. As the variable dimension grows, the cost difference between the B-SNB and Kononenko SNB will be bigger and bigger. On the other hand, the approaches by Pazzani [25] is impractical for even three attributes combination even though their approaches have a low cost report of O(n 3 ) when combining two attributes. Although it would be possible to consider joining three (or more attributes), the computational complexity makes it impractical for most databases [25]. Thus the accuracy of Pazzani SNB may be limited in this sense. Table I shows the analysis result of the above. Here Max means the maximum number of variables which involve in a large attribute. Thus the total computational cost for mixture of B-SNB: MBSNB per EM is thus: O(rn K+2 ).

19 TABLE I COMPUTATION COST TABLE Methods B-SNB Kononenko Pazzani Cost O(n K+2 ) O(q n ) O(n 3 ) Max K N 1 2 VI. EXPERIMENTS In this section, we firstly propose a set of pre-processing methods to handle some important issues, to implement our models. Then we describe our experiments setup in Subsection VI-B. In Subsectin VI- C, we demonstrate that our models are superior to other approaches in prediction accuracy. Finally in Subsection VI-D, we show that our mixture model has a good convergence performance. A. Pre-Processing methods To implement our algorithms, there are four main issues to be handled. They are numeric attributes, zero counts, missing values and parameter selection. We deal with these issues respectively as follows: Numeric attributes should be discretized into discrete attributes, since our algorithms can only handle discrete attributes. We discretized these numeric attributes into five equal intervals. Although this approach performs slightly less accurate than a more informed one [7], it is sufficient to evaluate the performance of main approaches in this paper. Zero counts are obtained when a given class and attribute value never occur in the training dataset. This may cause some problems in estimating the probabilities in the classification mapping function based on the frequency. For example, in NB s classification mapping function, if some value of an attribute A k never occur, the estimated P (A k C) will be zero. Consequently, the joint probability in the right part of Equation (1) will be 0, whatever the other terms P (A i C) are. This is especially a serious problem in implementing B-SNB algorithm and MBSNB algorithm since it is more possible that some values of a large attribute will never occur. This happens in the two situations. The

20 first is that some configurations in a large attribute may never occur; The second is that some values of a given attribute may not occur; To tackle the first issue, once we find such a configuration of a given large attribute, we degrade the related probability calculation of this large attribute to its attributes probability multiplication. For example, let B = {A l1, A l2,..., A lk } a large attribute, its configuration {A l1 = 0, A l2 = 1,..., A lk = 1} never occurs. We replace the probability calculation of P (A l1, A l2,..., A lk C) with k i=1 P (A l1 C). In this way, in handling sparse database, the B-SNB can at lease maintain the performance of NB since it will degrade towards the NB classifier for the absence of many configurations of large attributes. To tackle the second issue, we use the popular Laplace correction methods [27]. The modified estimated empirical probability for P (A j = a jk C i ) is (n ijk +f)/(n i +fn j ) instead of the uncorrected one: n ijk /n i, where a jk is the value of an attribute A j, n ijk is the number of times class C i and the value a jk of attribute A j occur together, n i is the number of the observations with class label as C i and n j is the number of values of attribute A j. We take the same value 1/N for parameter f as [6] [16], N is the number of samples in training database. The correction for a large attribute is similar as the above. Missing values is simply considered as another discrete value for the corresponding attribute. This is reasonable in some datasets, where missing values are those which cannot be determined as any of the values of a specific attribute. Parameter selection must be done for the cardinality of the large attribute K in B-SNB and MBSNB approaches. For the parameter K, some small values are often given. This is mainly for the reason that it will not be reliable to estimate the probability of a large attribute by empirical probability when its cardinality is too big. As mentioned above, a large K-cardinality large attribute will be more possible to encounter zero count problems. In our experiments, we set K to 2 or 3. B. Experimental Setup 1) Datasets: To evaluate the performance of our B-SNB and MBSNB models, we conduct a series of experiments on 7 databases, among which 6 come from UCI Machine learning Repository [26] and the other

21 1 dataset called Xor is generated synthetically. Xor dataset is synthetically generated, in which, the class variable is determined by first two binary attributes and other four binary attributes are created randomly. Table II give short descriptions about the used datasets in this paper. Detailed information about these datasets can be seen in [26]. To examine the performance of our approaches in this paper, we take the 5-folder Cross Validation (CV) method [15] to perform testing for some small or medium size datasets. TABLE II DESCRIPTION OF DATA SETS USED IN THE EXPERIMENTS Dataset Variables Class Train Test Xor CV-5 Vote CV-5 Tic-tac-toe CV-5 Vehicle CV-5 Segment % Post CV-5 Iris CV-5 2) Experiments Environment: Platform: Windows 2000; Software: Matlab 6.1; Hardware environment: 1.4G HZ, Pentium 3 Processor, 512M RAM. C. Prediction Accuracy We train a MBSNB model Q Ci for each class C i of every dataset. And we use the Bayes formula: c(x) = arg max C i P (C i )Q Ci (x) (19) to classify a new instance x.

22 Our initial experiments show that our B-SNB model outperforms Naive Bayesian classifiers. The comparisons are shown in Figure 5. B-SNB model can be further improved by upgrading into the mixture model as shown in Figure 6. To evaluate our mixture model s performance, we also do the comparisons between our model and two other competitive methods: CLT and C4.5. The experiments show that the mixture model of B-SNB performs better than the CLT as shown in Figure 7 and outperform C4.5 slightly as shown in Figure MBSNB Error BSNB Error NB Error NB Error Fig. 5. Scatter plot comparing NB and B-SNB. Points below the diagonal line correspond to data sets where B-SNB performs better, and points above the diagonal line correspond to data sets where NB performs better. 2-SNB means K, the cardinality of large attribute, is set to 2. It is similar for 3-SNB. Table III summarizes the prediction results of the main approaches in this paper. We calculate average accuracy for each approach, which is shown in Figure 9. It is observed that even the MBSNB is not certain to have best prediction performance for each dataset, its overall performance is the best as shown in Figure 6, Figure 7 and Figure 8. It is interesting to notice that in Post dataset, 2-BSNB performs better significantly than its 2-MBSNB upgrading. This situation may be due to two possible reasons. One reason is this dataset s sparsity. There are only 90 samples for 8 attributes and 3 classes. This sparsity will make the large attributes, even 2-cardinality large attributes probability estimation unreliably, which may influence the result of mixture models. The other possible reason is that 2-BSNB is the special case of 2-MBSNB. The performance decrease may imply that the dataset is not multi-modality one. This problem is actually how to

23 MNSNB Error MBSNB Error BSNB Error BSNB Error Fig. 6. Scatter plot comparing B-SNB and MBSNB. Points below the diagonal line correspond to data sets where MBSNB performs better, and points above the diagonal line correspond to data sets where B-SNB performs better. 2-SNB, 2-MBSNB means K, the cardinality of large attribute, is set to 2. It is similar for 3-SNB and 3-MBSNB. select the number of mixture component. A further discussion about this will be given in Section VII. TABLE III PREDICTION ACCURACY OF THE PRIMARY APPROACHES IN THIS PAPER(%) Dataset NB CLT 2-BSNB 3-BSNB C4.5 2-MBSNB 3-MBSNB Xor Tic-tac-toe Vote Vehicle Segment Post Iris Average

24 MBSNB Error MBSNB Error CLT Error CLT Error Fig. 7. Scatter plot comparing CLT and MBSNB. Points below the diagonal line correspond to data sets where MBSNB performs better, and points above the diagonal line correspond to data sets where CLT performs better.2-mbsnb means K, the cardinality of large attribute, is set to 2. It is similar for 3-MBSNB. D. Convergence Performance In this subsection, we examine the convergence performance of our mixture model. As discussed in the previous sections, B-SNB s optimization method will greatly influence the mixture model s convergence. In Figure 10 and Figure 11, we show the convergence curves for 4 datasets used in our experiments. Except Vote dataset, the other three datasets demonstrate good curves in convergence. The zigzag in vote is caused by the LP approximation for IP solution, which will slightly reduce the global optimality of the final B-SNB model. However even some zigzags occur, the overall curve trend for Vote is toward a convergence point. This implies that the approximation method is successful and our B-SNB model maintains a good global nature. VII. DISCUSSION There are two main issues to be discussed here, which are also the topics of our future work. The first one is about IP approximation based on LP technique. The second is about the choice of the component number for mixture structure. The LP approximation method for finding the IP solution may reduce the optimality of the final B-SNB

25 MBSNB 2 Error MBSNB Error C4.5 Error C4.5 Error Fig. 8. Scatter plot comparing CLT and MBSNB. Points below the diagonal line correspond to data sets where MBSNB performs better, and points above the diagonal line correspond to data sets where C4.5 performs better. 2-MBSNB means K, the cardinality of large attribute, is set to 2. It is similar for 3-MBSNB. Fig. 9. Average error of the main approaches in this paper. structure. However, compared with the infeasibility of direct IP solution, this approximation brings us a polynomial time cost. Though a hard analysis on the approximation level of LP is not available for the time being, our experiments show that this approximation maintains the convergence of the finite mixture upgrading, which thus implies a good approximation level of our approach. Although in this paper, we mainly discuss a finite mixture of B-SNB, it does not mean that choosing the number of the components is not important. As a open problem for mixture model, many researchers are

26 Class 1 Class 2 1 Class 1 Class Normalized log likelihood Normalized log likelihood Times Times (a) Tic-tac-toe (b) Vote Fig. 10. Convergence Performance on Tic-tac-toe and Vote dataset Class 1 Class 2 Class 3 Class Class 1 Class Normlized log likelihood Normalized log likelihood Times Times (a) Vehicle (b) Segment Fig. 11. Convergence Performance on Tic-tac-toe and Vote dataset now working on it [9] [10] [32] [33]. In this paper, we set the number of the component under some intuitive considerations. For the databases with more attributes and large number of training samples such as Tic-tactoe,Vote, Vehicle and Segment datasets, we simply set the component number as 10; For the databases with a small number of attributes or small number of training samples such as Xor, Post and Iris datasets, we set the component number as a small value 5. This is somewhat due to the consideration of the resistance to over-fitting. For instance, in a small size database, a large number of component will be more possible to incur an over-fitting problem. Obviously, this number of the components is one of the factors, which will influence the MBSNB s performance. As mentioned in Subsection VI-C, B-SNB model is the special case

27 of MBSNB with component number as 1. If the dataset belongs to a single-modality one, a mixture model will not be a suitable model for this dataset. How to select the component number is one of our future work. VIII. CONCLUSION Semi-Naive Bayesian Network Classifier, one of the restricted probability model, shows a good performance to expand the Naive Bayesian classifier, which is a competitive model when compared with some state-of-the-art classifiers such as C4.5. The mixture models have demonstrated successes in representing accurate distributions for real applications. Thus it may be promising to upgrade the Semi-Naive Bayesian network into a mixture model. However for its extraordinary large search space, traditional Semi-Naive Bayesian network has to take some local heuristic method to learn the structure from data. This local property for the traditional methods prevents them into the mixture upgrading, since it will not guarantee that the value of optimization function is certainly greater than its value in previous step and thus will not guarantee the convergence of the EM process. In this paper, we break through the bottle-neck for mixture upgrading of Semi-Naive-Bayesian network. We propose a Bounded Semi-Naive Bayesian network and transform the optimization problem for Semi- Naive Bayesian network into an Integer Programming problem. Our Semi-Naive Bayesian model is shown to enjoy a global nature and maintain a polynomial time cost. We then upgrade it into a finite mixture model. To our best knowledge, this is the first mixture model for Semi-Naive Bayesian network. Our experimental results show that this mixture model has a good convergence performance and really bring in an increase in prediction accuracy as well. IX. APPENDIX Proof for Lemma 1 Let S is a specific B-SNB with n variables or attributes which are represented respectively by A i, 1 i n.

28 And this B-SNB s large attributes are represented by B i, 1 i m. We use (B 1,..., B m ) as the short form of (B 1, B 2,..., B m 1, B m ). The log likelihood over a data set can be written into the following: l S (x 1, x 2,..., x s ) = = = = s log P (x j ) j=1 s m log( P (B i )) j=1 i=1 m s log P (B i ) i=1 j=1 m ˆP (Bi ) log P (B i ) i=1 B i (20) The above term will be maximized when P (B i ) is estimated by ˆP (B i ),the empirical probability for large attribute B i. This can be easily obtained by maximizing l S with respect to P (B i ). Thus, l Smax = = m i=1 m i=1 B i ˆP (Bi ) log ˆP (B i ) Ĥ(B i ) Proof for Lemma 2: We just consider a simple case, a general case proof is much similar. Consider one partition as µ = (B 1, B 2,..., B m ) and another partition as µ 1 = (B 1, B 2,..., B m 1, B m1, B m2 ),,where we have: B m1 B m2 = φ and B m1 B m2 = B m According to the proof of Lemma 1 above, we have: l Sµmax = m Ĥ(B i ) i=1

Finite Mixture Model of Bounded Semi-naive Bayesian Networks Classifier

Finite Mixture Model of Bounded Semi-naive Bayesian Networks Classifier Finite Mixture Model of Bounded Semi-naive Bayesian Networks Classifier Kaizhu Huang, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin,

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Probabilistic Time Series Classification

Probabilistic Time Series Classification Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Structure Learning: the good, the bad, the ugly

Structure Learning: the good, the bad, the ugly Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 Structure Learning: the good, the bad, the ugly Graphical Models 10708 Carlos Guestrin Carnegie Mellon University September 29 th, 2006 1 Understanding the uniform

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Mixtures of Gaussians with Sparse Structure

Mixtures of Gaussians with Sparse Structure Mixtures of Gaussians with Sparse Structure Costas Boulis 1 Abstract When fitting a mixture of Gaussians to training data there are usually two choices for the type of Gaussians used. Either diagonal or

More information

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes Mixtures of Gaussians with Sparse Regression Matrices Constantinos Boulis, Jeffrey Bilmes {boulis,bilmes}@ee.washington.edu Dept of EE, University of Washington Seattle WA, 98195-2500 UW Electrical Engineering

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

CHAPTER-17. Decision Tree Induction

CHAPTER-17. Decision Tree Induction CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

More information

Integer weight training by differential evolution algorithms

Integer weight training by differential evolution algorithms Integer weight training by differential evolution algorithms V.P. Plagianakos, D.G. Sotiropoulos, and M.N. Vrahatis University of Patras, Department of Mathematics, GR-265 00, Patras, Greece. e-mail: vpp

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

CSCE 478/878 Lecture 6: Bayesian Learning

CSCE 478/878 Lecture 6: Bayesian Learning Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Outline CSCE 478/878 Lecture 6: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchell

More information

Bayesian Learning. Bayesian Learning Criteria

Bayesian Learning. Bayesian Learning Criteria Bayesian Learning In Bayesian learning, we are interested in the probability of a hypothesis h given the dataset D. By Bayes theorem: P (h D) = P (D h)p (h) P (D) Other useful formulas to remember are:

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

Data Mining Part 4. Prediction

Data Mining Part 4. Prediction Data Mining Part 4. Prediction 4.3. Fall 2009 Instructor: Dr. Masoud Yaghini Outline Introduction Bayes Theorem Naïve References Introduction Bayesian classifiers A statistical classifiers Introduction

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Testing Problems with Sub-Learning Sample Complexity

Testing Problems with Sub-Learning Sample Complexity Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington Bayesian Classifiers and Probability Estimation Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington 1 Data Space Suppose that we have a classification problem The

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 by Tan, Steinbach, Karpatne, Kumar 1 Classification: Definition Given a collection of records (training set ) Each

More information

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification 10-810: Advanced Algorithms and Models for Computational Biology Optimal leaf ordering and classification Hierarchical clustering As we mentioned, its one of the most popular methods for clustering gene

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Chapter 14 Combining Models

Chapter 14 Combining Models Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients

More information

Week Cuts, Branch & Bound, and Lagrangean Relaxation

Week Cuts, Branch & Bound, and Lagrangean Relaxation Week 11 1 Integer Linear Programming This week we will discuss solution methods for solving integer linear programming problems. I will skip the part on complexity theory, Section 11.8, although this is

More information

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 4 Learning Bayesian Networks CS/CNS/EE 155 Andreas Krause Announcements Another TA: Hongchao Zhou Please fill out the questionnaire about recitations Homework 1 out.

More information

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata Principles of Pattern Recognition C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata e-mail: murthy@isical.ac.in Pattern Recognition Measurement Space > Feature Space >Decision

More information

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

Bayesian Network Classifiers *

Bayesian Network Classifiers * Machine Learning, 29, 131 163 (1997) c 1997 Kluwer Academic Publishers. Manufactured in The Netherlands. Bayesian Network Classifiers * NIR FRIEDMAN Computer Science Division, 387 Soda Hall, University

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Bayesian Networks Structure Learning (cont.)

Bayesian Networks Structure Learning (cont.) Koller & Friedman Chapters (handed out): Chapter 11 (short) Chapter 1: 1.1, 1., 1.3 (covered in the beginning of semester) 1.4 (Learning parameters for BNs) Chapter 13: 13.1, 13.3.1, 13.4.1, 13.4.3 (basic

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Classification Based on Logical Concept Analysis

Classification Based on Logical Concept Analysis Classification Based on Logical Concept Analysis Yan Zhao and Yiyu Yao Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada S4S 0A2 E-mail: {yanzhao, yyao}@cs.uregina.ca Abstract.

More information

Hidden Markov Models Part 1: Introduction

Hidden Markov Models Part 1: Introduction Hidden Markov Models Part 1: Introduction CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Modeling Sequential Data Suppose that

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework

More information

Bayesian Networks Inference with Probabilistic Graphical Models

Bayesian Networks Inference with Probabilistic Graphical Models 4190.408 2016-Spring Bayesian Networks Inference with Probabilistic Graphical Models Byoung-Tak Zhang intelligence Lab Seoul National University 4190.408 Artificial (2016-Spring) 1 Machine Learning? Learning

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 22 Dan Weld Learning Gaussians Naïve Bayes Last Time Gaussians Naïve Bayes Logistic Regression Today Some slides from Carlos Guestrin, Luke Zettlemoyer

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Not so naive Bayesian classification

Not so naive Bayesian classification Not so naive Bayesian classification Geoff Webb Monash University, Melbourne, Australia http://www.csse.monash.edu.au/ webb Not so naive Bayesian classification p. 1/2 Overview Probability estimation provides

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lesson 1 5 October 2016 Learning and Evaluation of Pattern Recognition Processes Outline Notation...2 1. The

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

On the errors introduced by the naive Bayes independence assumption

On the errors introduced by the naive Bayes independence assumption On the errors introduced by the naive Bayes independence assumption Author Matthijs de Wachter 3671100 Utrecht University Master Thesis Artificial Intelligence Supervisor Dr. Silja Renooij Department of

More information

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag Decision Trees Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Supervised Learning Input: labelled training data i.e., data plus desired output Assumption:

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Aijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules.

Aijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules. Discretization of Continuous Attributes for Learning Classication Rules Aijun An and Nick Cercone Department of Computer Science, University of Waterloo Waterloo, Ontario N2L 3G1 Canada Abstract. We present

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori

More information

IE418 Integer Programming

IE418 Integer Programming IE418: Integer Programming Department of Industrial and Systems Engineering Lehigh University 2nd February 2005 Boring Stuff Extra Linux Class: 8AM 11AM, Wednesday February 9. Room??? Accounts and Passwords

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 9: Expectation Maximiation (EM) Algorithm, Learning in Undirected Graphical Models Some figures courtesy

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Slides adapted from Jordan Boyd-Graber Machine Learning: Chenhao Tan Boulder 1 of 39 Recap Supervised learning Previously: KNN, naïve

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging 10-708: Probabilistic Graphical Models 10-708, Spring 2018 10 : HMM and CRF Lecturer: Kayhan Batmanghelich Scribes: Ben Lengerich, Michael Kleyman 1 Case Study: Supervised Part-of-Speech Tagging We will

More information

Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks

Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks Yoshua Bengio Dept. IRO Université de Montréal Montreal, Qc, Canada, H3C 3J7 bengioy@iro.umontreal.ca Samy Bengio IDIAP CP 592,

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

TDT4173 Machine Learning

TDT4173 Machine Learning TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Algorithms for Classification: The Basic Methods

Algorithms for Classification: The Basic Methods Algorithms for Classification: The Basic Methods Outline Simplicity first: 1R Naïve Bayes 2 Classification Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Networks BY: MOHAMAD ALSABBAGH Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology SE lecture revision 2013 Outline 1. Bayesian classification

More information

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1 Decision Trees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 5 th, 2007 2005-2007 Carlos Guestrin 1 Linear separability A dataset is linearly separable iff 9 a separating

More information

Introduction to SVM and RVM

Introduction to SVM and RVM Introduction to SVM and RVM Machine Learning Seminar HUS HVL UIB Yushu Li, UIB Overview Support vector machine SVM First introduced by Vapnik, et al. 1992 Several literature and wide applications Relevance

More information

16.4 Multiattribute Utility Functions

16.4 Multiattribute Utility Functions 285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Crowdsourcing via Tensor Augmentation and Completion (TAC)

Crowdsourcing via Tensor Augmentation and Completion (TAC) Crowdsourcing via Tensor Augmentation and Completion (TAC) Presenter: Yao Zhou joint work with: Dr. Jingrui He - 1 - Roadmap Background Related work Crowdsourcing based on TAC Experimental results Conclusion

More information

Machine Learning for Signal Processing Bayes Classification

Machine Learning for Signal Processing Bayes Classification Machine Learning for Signal Processing Bayes Classification Class 16. 24 Oct 2017 Instructor: Bhiksha Raj - Abelino Jimenez 11755/18797 1 Recap: KNN A very effective and simple way of performing classification

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

the tree till a class assignment is reached

the tree till a class assignment is reached Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal

More information

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups Contemporary Mathematics Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups Robert M. Haralick, Alex D. Miasnikov, and Alexei G. Myasnikov Abstract. We review some basic methodologies

More information