Optimization of Classifier Chains via Conditional Likelihood Maximization

Size: px
Start display at page:

Download "Optimization of Classifier Chains via Conditional Likelihood Maximization"

Transcription

1 Optimization of Classifier Chains via Conditional Likelihood Maximization Lu Sun, Mineichi Kudo Graduate School of Information Science and Technology, Hokkaido University, Sapporo , Japan Abstract Multi-label classification associates an unseen instance with multiple relevant labels. In recent years, a variety of methods have been proposed to handle the multi-label problems. Classifier chains is one of the most popular multi-label methods because of its efficiency and simplicity. In this paper, we consider to optimize classifier chains from the view point of conditional likelihood maximization. In the proposed unified framework, classifier chains can be optimized in either or both of two aspects: label correlation modeling and multilabel feature selection. In this paper we show that previous classifier chains algorithms are specified in the unified framework. In addition, previous information theoretic multilabel feature selection algorithms are specified with different assumptions on the feature and label space. ased on these analyses, we propose a novel multi-label method, k-dependence classifier chains with label-specific features, and demonstrate the effectiveness of the method. Keywords: multi-label classification, classifier chains, conditional likelihood maximization, k-dependence ayesian network, multi-label feature selection. 1. Introduction Unlike traditional single-label classification where each instance is associated with only one label, Multi-Label Classification (MLC) refers to the problems assigning multiple labels to a single test instance. MLC can be seen in a wide range of real-world applications such as text categorization, semantic image classification, bioinformatics analysis and video annotation. In fact, MLC is ubiquitous in real-world problems. For example, a news article Corresponding author addresses: sunlu@main.ist.hokudai.ac.jp (Lu Sun), mine@main.ist.hokudai.ac.jp (Mineichi Kudo) Preprint submitted to Pattern Recognition March 14, 2017

2 is possibly relevant to multiple topics, like science, technology, economics, politics, etc; a single image is probably associated with a set of semantic concepts, like sky, sea, field, building, etc. To tackle such multi-label problems, various MLC methods have been proposed. The existing MLC methods fall into two broad categories: problem transformation and algorithm adaptation [1]. As a convenient and straightforward way for MLC, problem transformation strategy transforms an MLC problem into one or a set of single-label classification problems, and learns one or a family of classifiers for modeling the single-label memberships. Most of popular baseline MLC methods, such as inary Relevance (R) [2], Calibrated Label Ranking (CLR) [3], and Label Powerset (LP) [4], belong to this strategy. Algorithm adaptation strategy induces conventional machine learning algorithms in the multi-label settings. Various MLC methods adopting one of the above two strategies have been developed and succeeded in dealing with multi-label problems. Classifier Chains (CC) [5] is one of the most promising MLC methods which follow the problem transformation strategy. Originated from R which simply ignores label correlations, CC constructs a chain structure on labels and determines the presence/absence of the current label under the condition of previously determined label. CC succeeds in modeling label correlations and achieves higher classification accuracy at similar computational expense with R. Although CC-based methods have achieved much success in various applications [5, 6, 7], further improvement in classification accuracy is still required. Here we seek the possibility to improve CC in terms of two aspects: label correlation modeling and multi-label feature selection. The intuition behind this idea is that all of the previously determined labels are not always necessary for decision on the current label (necessity of limiting label correlations), and irrelevant and redundant features are usually harmful for the performance of CC (necessity of feature selection). In this paper, we propose a unified framework comprising of both label correlation modeling and multi-label feature selection via conditional likelihood maximization of MLC. The contributions of this work are cast into three-folds. First, we propose a general framework taking label correlation modeling and multi-label feature selection into account via conditional likelihood maximization. Second, the k-dependence classifier chains method is proposed based on greedy iterative optimization of a sub-problem of likelihood maximization. 2

3 Third, a general information theoretic feature selection method is proposed for MLC, where three terms on relevancy, redundancy and label correlations are considered for feature subset selection. The rest of this paper is organized as follows. Section 2 discusses the related works, mainly focusing on CC-based methods and information theoretic feature selection. Section 3 illustrates the unifying framework for MLC by conditional likelihood maximization, and induces two sub-problems: model selection and multi-label feature selection. Sections 4 and 5 present the solutions on model selection and multi-label feature selection, respectively. Section 6 summarizes several theoretical findings during the development of the proposed method. Section 7 discusses the implementation issues. Section 8 introduces the experiments, and reports the results. Finally, Section 9 concludes this paper and discusses the further research. 2. Related works Previous efforts have been paid on MLC in terms of various viewpoints, such as label correlations modeling [4, 5], loss function analysis [6, 8, 9], large-scale learning [10, 11] and dimension reduction [12, 13, 14]. In this paper we concentrate mainly on two aspects: label correlations modeling and dimensionality reduction. It has been shown in a number of researches [3, 4, 5] that modeling label correlations is very crucial to perform accurate classification. On the other hand, various dimension reduction algorithms, including of feature selection [15, 16] and feature extraction [12, 17], have been employed in MLC, in order to simplify the learning phase and overcome the curse of dimensionality. In order to capture label correlations, Classifier chains (CC) based methods [5, 6, 7] have been proposed at tractable computational complexity. CC-based methods originates from inary Relevance (R) [2], which simply decomposes a multi-label problem into a set of binary classification problems, totally ignoring label correlations. In this sense, R is actually a hamming loss risk minimizer [8]. In CC [5], label correlations are expressed in an ordered chain of labels. In the learning phase, according to a predefined chain order, it builds a set of binary classifiers such that each classifier predicts the correct value of a target label by referring to the correct values of all the preceding labels in addition to the features. In the prediction phase, it predicts in turn the value of the target label using the previously 3

4 estimated values of its parent labels as extra features. However, the performance of CC is sensitive to the distinct chain orders, and it suffers from the problem of error propagation in the prediction phase. Several efforts have been paid to overcome the limitations of CC. ayesian Classifier Chains (CC) [7] introduces a directed tree as the probabilistic structure over labels. The directed tree is established by randomly choosing a label as its root and by assigning directions to the remaining edges. It shares the same model with CC, but restricts the number of parent label no more than 1, which limits its expression ability on label correlations. Probabilistic Classifier Chains (PCC) [6] aims to solve the error propagation problem, providing better estimates than CC at the expense of higher processing time. Although PCC shares the learning model with CC, it chooses the best predictor by searching the Maximum A Posterior (MAP) assignment in an exhaustive manner. The exponential cost of PCC in prediction limits its application. To make the prediction tractable for PCC, PCC-beam [18] is proposed by applying beam search to find an approximate MAP assignment of labels to a test instance. MCC [19] utilizes the Monte Carlo scheme to find the suboptimal chain order and perform efficient inference for the MAP assignment in the learning and prediction phase, respectively. In [20], the dynamic programming technique is used to search the globally optimal chain order of CC. In addition, to speed up the search procedure, a greedy approach is proposed to find locally optimal CC. In a recent work [21], the Classifier Trellis (CT) method is proposed for scalable MLC by extending the 1-dimensional chain of CC to a 2-dimensional trellis structure. CT saves label correlations in the trellis structure, where each label depends only on its adjacent labels. In this way, CT enables to limit the number of parent labels, and thus becomes scalable to the MLC problems with a large number of labels. In terms of Feature Space Dimensionality Reduction (FS-DR), a variety of traditional supervised dimension reduction approaches have been specifically extended to match the setting of MLC. In [22], a supervised Multi-label Latent Semantic Indexing (MLSI) approach is developed to map the input features into a subspace by preserving the label information. y maximizing the feature-label dependence under the Hilbert-Schmidt independence criterion, Multi-label Dimension reduction via Dependence Maximization (MDDM) [12] derived a closed-form solution to efficiently find the projection into the feature subspace. In addition, several traditional dimension reduction techniques, such as Canonical Correlation Analysis 4

5 (CCA) and Linear Discriminant Analysis (LDA), are proposed to handle the MLC problem [23, 24]. In the field of feature selection, an information theoretic approach for Label Powerset (LP) has been developed in [25]. The method introduces a nearest neighbor estimator for computing mutual information, and applies pruned LP to control the problem size. The multivariate mutual information criterion is used in [26] to select useful features. Due to its computational inefficiency, an approximate solution is proposed to estimate the multivariate mutual information. In [15], the authors extend the feature selection framework in [27] to handle two MLC decomposition methods, R and LP, which achieves significant improvement compared with several multi-label feature selection methods. In fact, all the methods mentioned above can be regarded as global FS-DR methods, since they attempt to find an identical feature subspace globally for all the labels. However, it is more reasonable to think that each label holds a specific supporting feature subset. To overcome the limitations, local FS-DR methods [17, 16] have been proposed to find label-specific features. In [17], LabelspecIfic FeaTures (LIFT) are extracted by conducting cluster analysis on the positive and negative instances of each label. The Learning Label-Specific Features (LLSF) method is proposed in [16]. LLSF selects label-specific features by optimizing the least squares problem with constraints of label correlations and feature sparsity. 3. The unified framework for MLC via likelihood maximization 3.1. Multi-label classification In the scenario of MLC, an observation (x, y) consists of a d-dimensional feature vector x and a q-dimensional target label vector y, drawn from the underlying random variables X = (X 1,..., X d ) R d and Y = (Y 1,..., Y q ) {0, 1} q, respectively. For an observation of labels y = (y 1,..., y L ), y j = 1 if the label j is relevant to the instance, and y j = 0 otherwise, j = 1,..., q. The task of MLC is to find an optimal classifier h : R d {0, 1} q, which assigns a label vector ŷ = h(x) to each instance x such that h minimizes a loss function between ŷ and y. For a loss function L(Y, h(x)), the optimal classifier h is h = arg min E xy L(Y, h(x)). (1) h Specifically, given the subset 0-1 loss L s (y, ŷ) = 1 y ŷ, where 1 ( ) denotes the indicator 5

6 function. Eq. (1) can be rewritten in a pointwise way, ŷ = h (x) = arg max p(y x). (2) y Here we use p(y X) to represent the conditional probability distribution of label variables Y given feature variables X. According to (2), the following two capabilities are necessary for an optimized multi-label classifier h in subset 0-1 loss. In the training phase, it should capture the underlying probability distribution p(y X); In the testing phase, given a test instance ˆx, the Maximum A Posteriori (MAP) assignment ŷ should be obtained by solving the optimization problem max y p(y ˆx). In this study, we focus on training a multi-label classifier with the first capability, i.e., successfully modeling p(y X), in order to learn an approximate optimized classifier in subset 0-1 loss A conditional likelihood view on MLC According to (2), the objective of MLC is actually to approximate the underlying conditional probability p(y X). In this paper, we assume that the underlying probability p(y X) can be approximated by a ayesian network of relatively simple structure with conditional probability p (Y X), which optimally captures the label correlations. In addition, we further assume that p (Y X) can be modeled by a subset of X, i.e., the relevant features X θ X. A p-dimensional binary vector θ is adopted with 1 indicating selected and 0 unselected. Hence, X θ /X θ denotes the selected/unselected feature subset, and thus X = {X θ, X θ }. Hence we have p (Y X) = p (Y X θ ). Last, a predictive model f(y X θ, τ) is built to model p (Y X θ ), where τ denotes the parameters of f used to predict y. In this way, we find the optimal ayesian network, identify the relevant feature subset X θ, and then build the predictive model f. Given a sample of N observations D = {(x i, y i ) i = 1,..., N} drawn from an underlying i.i.d. process p : X Y, the conditional likelihood L of the observations D given parameters {, θ, τ} becomes Therefore, our objective becomes L(, θ, τ D) = N f(y i x i θ, τ). (3) i=1 {, θ, τ } = arg max L(, θ, τ D) = arg max,θ,τ,θ,τ 6 N f(y i x i θ, τ). (4) i=1

7 For convenience, we use the average log-likelihood l instead of L: l = 1 N N log f(y i x i θ, τ). (5) i=1 y taking p (Y X θ ) into consideration, (5) is rewritten as l = 1 N N i=1 [ f(y i x i θ log, τ) p (y i x i θ ) p(y i x i θ ) p (y i x i ) p(y ] i x i ) p(y i x i ) p(yi x i ) y the law of large numbers, l approaches in probability the expected version [ f(y Xθ, τ) E xy {log p (Y X θ ) p(y X θ ) p (Y X) p(y X) ]} p(y X) p(y X) (6) (7) We negate the above formula to minimize { l E xy log p { (Y X θ ) }+E xy log p { (Y X) }+E xy log p(y X) } E xy {log p(y X)} f(y X θ, τ) p (Y X θ ) p (Y X) (8) Since X = {X θ,x θ }, the second term can be developed as follows, { E xy log p } { (Y X) = E xy log p } (Y X θ, X θ ), p (Y X θ ) p (Y X θ ) = E xy {log p (X θ, Y X θ ) p (X θ X θ )p (Y X θ ) }, = I p (X θ ; Y X θ ), (9) where I(X; Y Z) denotes the mutual information between X and Y conditioned on Z, I(X; Y Z) = xyz p(x, y, z) log p(x, y z) p(x z)p(y z) = E p(x, Y Z) xyz log p(x Z)p(Y Z). (10) Using the K-L divergence [28]: D KL (p q) = x p(x) log p(x) { q(x) = E x log p(x) }, (11) q(x) (8) is finally rewritten as l D KL (p f) + I p (X θ ; Y X θ ) + D KL (p p ) + H(Y X) (12) From (12), our objective function can be decomposed into four different terms. 7

8 1. D KL (p f): the K-L divergence between the conditional probability p (Y X θ ) and the predictive model f(y X θ, τ), which measures how well f approximates p given the selected feature subset X θ. This parameter τ could be optimized by the predefined predictive model given the optimized parameters and θ, τ = arg min D KL (p (Y X θ ) f(y X θ, τ)). (13) τ Thus (13) is the parameter selection problem. Distinct predictive models would produce the different τ. It depends on our choice of the baseline model f. 2. I p (X θ ; Y X θ ): the mutual information between the unselected features X θ and the labels Y conditioned on the selected features X θ. This term depends on both the approximate ayesian network and the selected features X θ. Given the optimized, the optimal θ can be obtained as In fact (14) is the multi-label feature selection problem. θ = arg min I p (X θ ; Y X θ ). (14) θ 3. D KL (p p ): the K-L divergence between the underlying probability p(y X) and the approximate probability p (Y X) modeled by a ayesian network, which measures how well p approximate p. This term depends only on the ayesian network, hence we have = arg min D KL (p(y X) p (Y X)). (15) Note that (15) is actually the model selection problem, which aims to find the optimal ayesian network to capture label correlations. 4. H(Y X): the conditional entropy of labels Y given features X. This term presents the uncertainty in the labels when all features are known, which is a bound on the ayes error [29]. Since it is independent of all parameters, we can remove this term from our optimization problem. ased on above discussion, we will take the following strategy: Optimization Strategy. We address the problem of multi-label classification with feature selection in three stages: first learning label space structure, then selecting useful feature subset, and last building the predictive model. 8

9 1. Model Selection max L ( D) = min D KL (p(y X) p (Y X)) = max σ q j=1 I(Y j ;Pa j X) (19) Objective Framework 2. Feature Selection max,θ,τ = max,θ,τ i=1 L(, θ, τ D) N f(y i x θ i, τ) (4) min D KL (p f),θ,τ + I p (X ഥ θ ; Y X θ) + D KL (p p (12) max L (θ D) θ = min I p (X ഥ θ θ ; Y X θ) = max I p θ (X θ ;Y) (35) 3. Parameter Selection max L (τ D) τ = min τ D KL (p (Y X θ ) f(y X θ, τ)) (13) Figure 1: The framework of MLC via conditional likelihood maximization. Fig. 1 shows the framework of MLC via conditional likelihood maximization following the Optimization Strategy. In this strategy, instead of directly addressing the optimization problem (4), we solve the sub-problems (15), (14), and (13) independently. In the following sections, we shall discuss how the sub-problems (15) and (14) can be solved in Section 4 and 5, respectively. In addition, we see that some popular MLC methods and multi-label feature selection algorithms are embedded in this framework as the special cases of optimization of sub-problems with appropriate assumptions on the label or feature space. 4. Model selection Under the Optimization Strategy, to model the underlying probability distribution p(y X), the optimal ayesian network of a special type could be obtained by optimizing arg max L( D) = arg min D KL (p(y X) p (Y X)). (16) Here we limit in the k-dependence ayesian (kd) network : p (Y X) = q p(y j Pa j, X), (17) j=1 where Pa j represents the parents of label j, Pa j = min{j 1, k}, k [0, q 1]. Note that kd is limited in the number of parents up to k compared with the chain rule in the 9

10 canonical order of Y 1, Y 2,..., Y q : p(y X) = q p(y j Y 1, Y 2,..., Y j 1, X). (18) j=1 The optimization problem (16) has been addressed in our previous work [30] as a theorem. Theorem 1. To approximate a conditional probability p(y X) in a certain family of ayesian networks, the optimal ayesian network in K-L divergence is obtained if the sum of conditional mutual information between each variable of Y and its parent variables given the observation X is maximized. Proof. The optimization problem of (15) can be developed as follows: { = arg min E xy log p(y X) } p (Y X) = arg max E xy {log p (Y X)} + H(Y X), H(Y X) can be omitted due to its independence with, thus we have = arg max E xy {log p (Y X)} q = arg max = arg max = arg max E xy {log j=1 j=1 p(y j Pa j, X) q } p(y j, Pa j X) E xy {log p(y j X)p(Pa j X) p(y j X) q q I(Y j ; Pa j X) H(Y j X). j=1 Since q j=1 H(Y j X) is independent of, we reach our conclusion: j=1 = arg min D KL (p(y X) p (Y X)) = arg max } q I(Y j ; Pa j X). (19) j=1 The objective of (16) is to save label correlations in a ayesian network. In [8], a formal definition on two types of label correlations, namely conditional and marginal dependence, is given. Conditional dependence captures the dependence of labels given a specific instance, while marginal dependence can be considered as the expected dependence averaged over all 10

11 instances [8]. According to the definition, the optimal ayesian network in (19) obtained by Theorem 1 actually models conditional label correlations. For more information on label dependence in MLC, see [8]. According to Theorem 1, the following corollary is derived. Corollary 1. Given a specific order of labels, the probability p modeled by k-dependence ayesian network optimally approximates p(y X) in terms of K-L divergence if the parents of the label Y j holds the restriction Pa j = j 1, j = 1,..., q. Proof. We assume that labels are ordered in Y 1, Y 2,..., Y q, the possible parents of label j have been restricted as its previous labels S j = {Y 1, Y 2,..., Y j 1 }, j = 1,..., q. Hence, the optimization problem (19) can be independently solved by each label, and is equivalent to select optimally the parent labels from the previous labels. = arg max I(Y j ; Pa j X), j = 1,..., q. (20) Let Pa j be S j \ Pa j. ased on the chain rule of mutual information, we have I(Y j ; S j X) = I(Y j ; Pa j X) + I(Y j ; Pa j Pa j, X). (21) Since conditional mutual information is always non-negative, i.e., I(Y j ; Pa j Pa j, X) 0, we have I(Y j ; S j X) I(Y j ; Pa j X). Hence, (20) is optimized by treating all previous labels as the parents of Y j, i.e., Pa j = S j. In this way, a fully-connected ayesian network is chosen for, thus Pa j = j 1, j = 1,..., q Review CC-based methods For the Classifier Chains (CC) and Probabilistic Classifier Chains (PCC) methods, given a predefined chain order, the classifier for each label is trained by taking the previous labels as extra features. In this sense, a fully-connected ayesian network is constructed by CC and PCC according to a particular chain order. The difference between CC and PCC is that, in the testing phase, CC finds its prediction by the greedy search: ŷ j = arg max p(y j pa ˆ j, ˆx), j = 1,..., q. (22) y j In contrast, PCC aims to find the MAP assignment by searching 2 q paths according to (2), resulting in an exponential complexity in q for the testing phase [6]. Thus PCC is actually a 11

12 (a) R (b) CC and PCC (c) CC (d) kcc (k = 2) Figure 2: Graphical models of CC-based methods in order Y 1 Y 2 Y 3 Y 4. risk minimizer in terms of subset 0-1 loss [31], while CC can be considered as a deterministic approximation to PCC. The predictions of PCC (ŷ pcc ) and CC (ŷ cc ) shall be equal if the conditional probability of the MAP assignment (ŷ map ) is more than 0.5 [6], i.e., ŷ cc = ŷ pcc, if p(ŷ map ˆx) > 0.5. As stated in Sec. 3.1, this study focuses on modeling p(y X) rather than finding the MAP assignment. The following corollary is introduced for four CC-based methods, CC, kcc, CC and R, which conduct the simple greedy search in the testing phase. ased on Corollary 1, the following corollary on these CC-based methods is derived. Corollary 2. In the same chain order of q labels, in the ideal case, classifier chains asymptotically outperform k-dependence classifier chains in terms of subset 0-1 loss, when k < q 1. Similarly, k 1 -dependence classifier chains asymptotically outperforms k 2 -dependence classifier chains in subset 0-1 loss for 0 k 2 < k 1 q 1. Proof. ased on Corollary 1, given a chain order, classifier chains optimally models p(y X) by a fully-connected ayesian network. Therefore, according to (2), in fact CC is the best approximation of the risk minimizer in subset 0-1 loss among the four CC-based methods. Moreover, for the k-dependence classifier chains, K-L divergence between p (Y X) and p(y X) always decreases as the value of k increases. According to (2) again, we reach our conclusion. Here we note that all of CC, CC and R can be viewed as the special cases of kcc by setting the value of k as q 1, 1 and 0, respectively. CC works better than ayesian Classifier Chains (CC) and inary Relevance (R) in approximating ability of the underlying distribution p(y X). Therefore, by (2), CC outperforms CC and R in subset 0-1 loss in 12

13 an asymptotic case. According to Corollary 2, given the same chain order, as an asymptotic analysis, we have the following order in performance: PCC CC kcc CC R, 1 < k < q 1. (23) Here A presents that A asymptotically outperforms in subset 0-1 loss. It is worth noting that, PCC is considered in (23) since it is the exact optimizer in subset 0-1 loss. Fig. 2 shows the probabilistic graphical models of CC-based methods over four labels in a chain order Y 1 Y 2 Y 3 Y k-dependence ayesian network for CC As discussed above, using all preceding labels as Pa j = S j is optimal for approximating p(y X) in a given chain order. However it is not always true in real-world problems with a finite number of instances. It is often possible to build a better model by eliminating irrelevant and redundant parent labels. Successful elimination of such useless features simplifies the learning model, and gives a better performance. In this study, therefore, we propose to use the k-dependence ayesian network [32] where at most k preceding labels are adopted as the parents of a label. As shown in the last section, given a chain order, the optimization problem (19) can solved independently for each label. Hence, it suffices to solve the problem arg max L( D) = arg max I(Y j ; Pa j X), j = 1,..., q, (24) where Pa j S j and Pa j k. However, the calculation of conditional mutual information costs very highly for the high-dimensional feature vectors. Thus, we find an approximation for I(Y j ; Pa j X), relying on some appropriate assumptions. Theorem 2. arg max I(Y j ; Pa j ) = arg max I(Y j ; Pa j X), if the label Y j is independent of the feature vector X conditioned on the parent labels Pa j. Proof. According the fact I(A; C) I(A; ) = I(A; C ) I(A; C), we have I(Y j ; Pa j X) = I(Y j ; Pa j ) + I(Y j ; X Pa j ) I(Y j ; X). (25) 13

14 If the conditional independence assumption holds, the second term of (25) vanishes. addition, the third term is independent of ayesian network. Thus we have the consistence. In To simplify the computation, a forward parent label selection algorithm is developed here for the optimization problem = arg max I(Y j ; Pa j ). Since I(Y j ; Pa j Y l ) = I(Y j ; Pa j ) + I(Y j ; Y l Pa j ), (26) starting from Pa j =, we can update Pa j by adding Y l Pa j such that Y l = arg max I(Y l ; Y j Pa j ), (27) Y l Pa j until Pa j reaches a predefined number. I(Y l ; Y j Pa j ) is further developed as, I(Y l ; Y j Pa j ) = I(Y l ; Y j ) + I(Y j ; Pa j Y l ) I(Y j ; Pa j ) = I(Y l ; Y j ) + H(Pa j Y l ) H(Pa j Y j, Y l ) I(Y j ; Pa j ) (28) Parent Independence Assumption. For target label Y j, one preceding label Y k Pa j, and Y l Pa j, we assume: p(pa j Y l ) = p(y k Y l ), (29) Y k Pa j p(pa j Y j, Y l ) = p(y k Y j, Y l ). (30) Y k Pa j These assumptions require that parent labels are conditional independent given one unselected parent Y l (29) or given Y l with the target label Y j (30). Under this assumption, we have the following: I(Y l ; Y j Pa j ) = I(Y l ; Y j ) + H(Y k Y l ) H(Y k Y j, Y l ) I(Y j ; Pa j ), Y k Pa j Y k Pa j = I(Y l ; Y j ) + I(Y k ; Y j Y l ) I(Y j ; Pa j ). (31) Y k Pa j Substituting (31) into (27) and removing the last term independent of Y l, we have Y l = arg max I(Y l ; Y j ) + I(Y j ; Y k Y l ). (32) Y l Pa j Y k Pa j 14

15 In (32), the first term captures the relevance between the unselected parent label Y l and the target label Y j, while the second term models the mutual information between the selected label Y k and the target label Y j, conditioned on the unselected label Y l. Since we have I(Y j ; Y k Y l ) = I(Y j ; Y k ) + I(Y l ; Y k Y j ) I(Y l ; Y k ), the second term actually captures both conditional redundancy I(Y l ; Y k Y j ) and redundancy I(Y l ; Y k ). It means that the parent label which has the best trade-off between relevancy and redundancy would be selected. worth noting that in [27], the authors reach the similar conclusion for information theoretic feature selection. The difference is that here our objective is to perform parent label selection to find the optimal k-dependence ayesian network given a chain order. We can generalize (32) as Y l = arg max I(Y l ; Y j ) + α I(Y j ; Y k Y l ), j = 1,..., q, (33) Y l Pa j Y k Pa j where α 0 denotes a parameter controlling the weights assigned to the two terms. In practice, we prefer to set α = 1 Pa j to prevent from sweeping the first term as the number of parent labels grows. ased on (33), we propose the k-dependence Classifier Chains (kcc) method. It is 5. Multi-label feature selection ased on the Optimization Strategy, the following equation can be developed based on the built ayesian network, arg max θ L(θ D) = arg min I p (X θ ; Y X θ ). (34) θ Note that there is no monotonicity in the conditional mutual information, so that θ = (0, 0,..., 0) or θ = (1, 1,..., 1) is not always the solution. According to the chain rule of mutual information and X = {X θ, X θ }, we have I p (X; Y) = I p (X θ ; Y) + I p (X θ ; Y X θ ). (35) From the above formula, we see that minimization of I p (X θ ; Y X θ ) is equivalent to maximization of I p (X θ ; Y), thus (34) is transformed into: arg max θ L(θ D) = arg max I p (X θ ; Y). (36) θ 15

16 Note that the similar optimization problem has been proposed by intuition in the field of information theoretic feature selection for the single-label variable Y in a variety of papers [27, 33, 34]. Here we theoretically derive the optimization problem from conditional likelihood maximization for the multi-label case, and take label correlations into account by the learned ayesian network. However, directly solving (36) is a non-trivial thing due to the difficulty on the calculation of mutual information between high-dimensional X θ and Y. Hence, similar with Section 4.2, the greedy sequential optimization is applied for (36). Since I p (X θ X m ; Y) = I p (X θ ; Y) + I p (X m ; Y X θ ), (37) Starting from X θ =, we can update X θ by adding X m X θ such that X m = arg max I p (X m ; Y X θ ). (38) X m X θ until X θ reaches a predefined number. y taking the ayesian network into consideration, we have I p (X m ; Y X θ ) = = q I(X m ; Y j Pa j, X θ ), j=1 q (I(X m ; Y j ) + I(Pa j, X θ ; Y j X m ) I(Pa j, X θ ; Y j )). (39) j=1 Parent and Feature Independence Assumption. For each feature X n X θ, given one unselected feature X m X θ and the target labels Y, we assume the following: p(pa j, X θj Y j ) = p(y k Y j ) p(x n Y j ), Y k Pa j X n X θ p(pa j, X θ Y j, X m ) = p(y k Y j, X m ) p(x n Y j, X m ), Y k Pa j X n X θ It requires that the parents Pa j and selected features X θ are conditional independent of Y j with or without the unselected feature X m. ased on the above assumption and (39), (38) becomes: q X m = arg max I(X m ; Y j ) + I(Y j ; Y k X m ) + I(Y j ; X n X m ). (40) X m X θ j=1 Y k Pa j X n X θ 16

17 The first term captures the Relevance between the unselected feature X m and the target label Y j, and the second term captures Label Correlations modeled in k-dependence ayesian network. For the third term of (40), we express it as I(Y j ; X n X m ) = I(Y j ; X n ) + I(X m ; X n Y j ) I(X m ; X n ). (41) Since I(Y j ; X n ) is independent of X m, I(Y j ; X n X m ) can be represented by last two terms. Here we call I(Y j ; X n X m ) as Redundancy Difference, since it in fact reflects the difference of Conditional Redundancy I(X m ; X n Y j ) and Redundancy I(X m ; X n ). Hence, according to (40), one feature would be selected in that it provides the best trade-off among relevance, label correlations and redundancy. We can further generalize (40) as, q X m = arg max I(X m ; Y j ) + β I(Y j ; Y k X m ) + γ I(Y j ; X n X m ). (42) X m X θ j=1 Y k Pa j X n X θ where β, γ 0 are two parameters controlling the weights on correlation and redundancy difference terms, respectively. In practice, we typically set β = 1 and γ = 1 Pa X θ to normalize the terms, preventing from sweeping either term as the number of Pa j or X θj grows. This means we hold a relatively weak assumptions on conditional label independence (the 2nd term) and feature independence (the 3rd term). In the derivation of (42), we assume the relevant features are identical for all the labels. However, it is more general and reasonable to assume that the selected feature subset is labelspecific. For the target label j, its relevant features are denoted as X θj, thus X = {X θj, X θj }, j = 1,..., q. ased on the similar derivation and assumption of (42), we can obtain the following feature selection criterion for label j, j = 1,..., q, X m = arg max I(X m ; Y j ) + β I(Y j ; Y k X m ) + γ I(Y j ; X n X m ). (43) X m X θj Y k Pa j X n X θj In this case, we set β = 1 Pa j and γ = 1 X θj. Note that the criterion (43) is optimized label- independently, since X m X θj and X θj is label-specific. In this paper, we employ (42) and (43) for global and local Multi-Label Feature Selection (MLFS), respectively. 6. Summary of theoretical findings In Table 1, we summarize the techniques above for comparing model selection and feature selection algorithms. y combining kcc (33) and local MLFS (43), we propose the optimized 17

18 Table 1: The generality of the proposed framework for MLC. Parameter Setting Algorithm Authors Pa j = 0 inary Relevance (R) outell et al. (2004) [2] Model Pa j 1 ayesian CC (CC) Zaragoza et al.(2011) [7] Selection Pa j = j 1 Classifier Chains (CC) Read et al. (2011) [5] Pa j k kcc (Proposed) Sun and Kudo (2017) β = γ = 0 MI Maximization (MIM) Lewis (1992) [35] Feature q = 1 β = 0, γ = 1 X θ Joint MI (JMI) Yang and Moody (1999)[34] Max-Rel Min-Red Peng et al. (2005) [33] Selection β = γ = 0 Multi-Label MIM Scchidis et al (2012) [15] q > 1 β = 0, γ = 1 X θ Multi-Label JMI Scchidis et al. (2012) [15] β = 1 Pa j, γ = 1 X θ MLFS (Proposed) Sun and Kudo (2017) Max-Rel Min-Red ignores the conditional redundancy term in (42). Classifier Chains (occ) method, whose procedure is outlined in Algorithm Implementation issues 7.1. Mutual information estimation In the optimizations of both model and feature selection, calculation of mutual information is extensively used. For the discrete/categorical feature variables X (Y is originally binary), the calculation of mutual information is simple and straightforward. Given a sample of N i.i.d. observations {x i, y i } N i=1, based on the law of large numbers, we have the following approximation: I(X; Y ) = x,y p(x, y) log p(x, y) p(x)p(y) 1 N N i=1 log ˆp(xi, y i ) ˆp(x i )ˆp(y i ), (44) where ˆp denotes the empirical probability distribution. In contrast, when the feature variable X is continuous, it becomes quite difficult to compute mutual information I(X; Y ), since it is typically impossible to obtain ˆp. One of the solutions is to use kernel density estimation, in which case, ˆp is approximated by the following: ˆp(x) = 1 N N K h (x x i ), (45) i=1 18

19 Algorithm 1 The algorithm of occ (kcc + MLFS) Input: D = {x i, y i } N i=1: training set with q labels, chain: label order, ˆx: test instance, k: maximal size of parents, M: size of selected features, f 1,..., f q : q predictive models Output: ŷ: predicted label set of ˆx Training: 1: S, {Pa j } q j=1 ; 2: for j chain do 3: for l = 1,..., k do 4: select Y l from S according to (33); 5: Pa j Pa j Y l ; 6: for m = 1,..., M do 7: select X m from X according to (43); 8: X θj X θj X m ; 9: build classifier f j on D j, where D j = {x i θ j pa i j, y i j} N i=1; 10: S S Y j ; Testing: 11: for j chain do 12: ŷ j f j (ˆx θj pa ˆ j ); 13: ŷ (ŷ 1, ŷ 2,..., ŷ q ). where K h ( ) is a non-negative kernel function with bandwidth h (h > 0). Typically, the Gaussian kernel can be used as K h ( ). However, it is computationally expensive to compute (45) and usually difficult to select a good value of the bandwidth h. Another solution for computing I(X; Y ) with continuous feature X is to apply data discretization as preprocessing [27]. In this study, we adopt this method and discretize the continuous variable X based on its mean µ X and standard deviation σ X. For example, we divide a continuous value of feature variable X into one of three categories { 1, 0, 1} according to µ X ± σ X. The experimental results demonstrate the efficiency of such approach for approximating I(X; Y ). Note that data discretization is performed only for calculating mutual information in feature selection, which means that original continuous/discrete features will be utilized to train classification models after the selection of useful features. 19

20 Algorithm 2 The algorithm of chain order selection Input: D = {x i, y i } N i=1: training data with q labels; Output: π: selected chain order of q labels; 1: S π, S π {1, 2,..., q}; 2: Initialize π(1) arg max q d π(1) j=2 m=1 I(Y π(1); Y π(j) X m ); 3: S π π(1), S π S π /π(1); 4: for j = 2,..., q do 5: Select π(j) S π according to (47); 6: S π S π π(j), S π S π /π(j); 7.2. Chain order selection The chain order is crucial to the performance of CC-based methods [5, 6]. To improve the performance, [18] and [19] propose to select the correct chain order by using the eam Search and Monte Carlo algorithm, respectively. In this study, we develop a greedy algorithm for efficient chain order selection according to Theorem 1. Different from Theorem 1, where model selection is conducted to find the optimal k-dependence ayesian network ( Pa j k, j) based on the canonical order, here we aim to perform model selection by selecting the optimal chain order π for a fully-connected ayesian network π ( Pa j = j 1, j). Suppose that a permutation π : {1, 2,..., q} {1, 2,..., q} determines the optimal chain order of labels, so that the jth label is placed at the π(j)th position of the chain. Thus, following Theorem 1, we have π = arg min D KL (p(y X) p π (Y X)) = arg max π π q I(Y π(j) ; Pa j X), (46) where Pa j = {Y π(k) } j 1 k=1. Similar to the derivation in Section 4.2, a greedy forward search algorithm can be developed to solve the optimization problem in (46) with appropriate independence assumption. In the jth iteration, we select the best place π(j) for the jth label according to the following formula: ( ) j 1 π 1 d (j) = arg max I(Y π(j) ; Y π(k) X m ) + 1 k 1 I(Y π(k) ; Y π(l) Y π(j) ). (47) π(j) d k 1 k=1 m=1 Algorithm 2 outlines the procedure of chain order selection. In Sec. 8.5, we shall show the efficiency of Algorithm l=1 j=1

21 Table 2: The statistics of experimental multi-label datasets. In below, N, d and q are the data size in instances, features and labels, respectively. Cardinality, Density and Distinct denotes the label cardinality, label density and number of distinct label combinations, respectively. Dataset N d q Cardinality Density Distinct Feature type Domain emotions numeric music scene numeric image yeast numeric biology birds numeric audio genbase nominal biology medical nominal text enron nominal text languagelog nominal text rcv1s numeric text bibtex nominal text corel16k nominal image corel5k nominal image delicious nominal text 8. Experiments 8.1. Datasets and evaluation metrics We conducted experiments on thirteen benchmark datasets in Mulan [36] and Meka [37], in order to evaluate the performance of the proposed method. The statistics of the datasets are summarized in Table 2, where Cardinality and Density denote the label cardinality and label density, respectively. In addition, Distinct denotes the number of distinct label combinations, and Feature type presents the property of features. Given a test dataset T = {x i, y i } N T i=1, we use four evaluation metrics to report the experimental results. Here 1 denotes the indicator function. Exact-Match := 1 N T Hamming-Score := 1 N T Macro-F1 := 1 L L l=1 NT i=1 1 ŷ i =y i, NT 1 L i=1 L l=1 1 ŷl i, =yi l 2 N T i=1 ŷi l yi l NT i=1 ŷi l + N T i=1 yi l, Micro-F1 := 2 L NT l=1 i=1 ŷi l yi l L NT l=1 i=1 ŷi l + L NT l=1 i=1 yi l. 21

22 The above metrics can be cast into two categories, instance-based metrics (Exact-Match and Hamming-Score) and label-based metrics (Macro-F1 and Micro-F1) [38]. Among the metrics, Exact-Match is the most stringent measure, since it does not evaluate partial match of labels. In spite of that, it is a good metric to measure how well label correlations are modeled. Hamming-Score emphasizes on the prediction accuracy on label-instance pairs and is able to evaluate the performance on each single label. Macro-F1 and Micro-F1 take the partial match of label sets into account, providing good supplements for instance-based metrics. In addition, Macro-F1 is more sensitive to the performance of rare categories (the labels in minority), while Micro-F1 is affected more by the major categories (the labels in majority). Hence, joint use of Macro-F1 and Micro-F1 should be a good supplement for the instance-based evaluation metrics to evaluate the performances of MLC methods Configuration In the experiments we compare the following seven MLC methods: R: inary Relevance (R) [2]. It transforms a multi-label problem with q labels into q binary classification problems according to the one-vs-all strategy. ECC: Ensemble of CCs (ECC) [5]. A number of CCs are established by randomly selecting the chain orders, and the final prediction is made by majority votes. ECC: Ensemble of CCs (ECC) [7]. A number of CCs are established by randomly selecting theirs roots, and its prediction is made by the same way ECC does. MDDM: Multi-label Dimension reduction via Dependence Maximization (MDDM) [12]. It is a global FS-DR method. y maximizing the feature-label dependence, the projection is built to project the features into the low-dimensional subspace. LLSF: Learning Label-Specific Features (LLSF) [16]. It is a local FS-DR method. Label-specific features are selected by optimizing the least squares problem with constraints of label correlation and feature sparsity. occ 1 : Optimized Classifier Chains (occ). The proposed MLC method by combining model selection (kcc) and local feature selection (MLFS). 22

23 EoCC 1 : Ensemble of occs (EoCC). The same ensemble strategy of ECC is applied. R was used as the baseline MLC method. As the ensemble CC-based methods, ECC and ECC were introduced due to their superior performances over some MLC decomposition methods as shown in [5, 7]. MDDM was chosen as a representative of global FS-DR methods, which outperformed several global FS-DR methods, like PCA, LPP [39], MLSI [22], CCA [23], as reported in [12]. As a local FS-DR method, LLSF was chosen due to its performance advantage in comparison with another local FS-DR method, LIFT [17], as reported in [16]. In the experiments, 5-fold cross validation was performed to evaluate the classification performance. For fair comparison, a linear SVM implemented in Liblinear [40] was used as the baseline binary classifier for all the comparing methods. In parameter setting, we tune the important parameters on controlling the feature sparsity for the FS-DR methods, such as MDDM, LLSF and occ/eocc. Specifically, the dimensionality of feature space in MDDM (thr) and occ/eocc (M) is selected from {0.1, 0.3, 0.5, 0.7, 0.9} of total number d of features, while the value of β in LLSF is tuned in range of {0.001, 0.01, 0.1, 1, 10}. As suggested in [12] and [16], we set the regularization parameter µ = 0.5 in MDDM, and the parameters α = 0.5 and γ = 0.01 in LLSF. To balance the performance of occ in Exact-Match and Macro/Micro-F1, we set the parameters as k = 0.8 q. For the ensemble methods, we use an ensemble of 10 CCs/CCs/oCCs for building ECC/ECC/EoCC. In order to scale up the ensemble methods, random sampling is applied to randomly select 75% of instances and 50% of features for building each single model of the ensemble, as recommended in [5]. All the comparing methods were implemented in Matlab, and experiments were performed in a computer configured with an Intel Quad-Core i CPU at 3.4GHz with 4G RAM On k-dependence classifier chains To evaluate the performance of model selection, an experiment was performed by kcc on four datasets, where we increased the number of parents k from 0 to q 1 by step 1. For convenience, the values of each metric were normalized by dividing its maximum. Fig. 3 shows the experimental results in four metrics averaged by 5-fold cross validation. Consistent with the theoretical analysis, the performance of kcc in Exact-Match upgrades as the value of k increases except on the medical dataset. Moreover, as the value of k approaches 1 We provide the codes of occ and EoCC at: 23

24 1 scene 1 medical 1 enron 1 rcv1s Performance Exact-Match Hamming-Score Macro-F1 Micro-F k Performance Exact-Match Hamming-Score Macro-F1 Micro-F k Performance Exact-Match Hamming-Score Macro-F1 Micro-F k Performance Exact-Match Hamming-Score Macro-F1 Micro-F k Figure 3: Performances of kcc on four different datasets in four evaluation metrics, whose values in each metric are normalized by dividing its maximum. The number k of parents is increased from 0 to q 1 by step 1. The indices of the maximum in Exact-Match over the four datasets are 5, 11, 12 and 88, respectively. 0.8 q, the performance becomes stable. In terms of Macro/Micro-F1, the performances share the similar tendency with Exact-Match, but the changes on values are relatively small. In Hamming-Score, its performance seems irrelevant to the change of k. Therefore, it is suggested to set the value of k as 0.8 q if the objective is to optimize Exact-Match On multi-label feature selection To evaluate the performance of multi-label feature selection, another experiment was performed by five information theoretic feature selection algorithms. The algorithms are generated according to (42) and (43) as Multi-Label Mutual Information Maximization (MLMIM) [15] (β = γ = 0 of (42)), Multi-Label Joint Mutual Information (MLJMI) [15] and Multi- Label Max-Relevance Min-Redundancy (MLMRMR) [33] (β = 0, γ = 1/ X θ of (42), but MLMRMR ignores the conditional redundancy term.), MLFS local (β = 1/ Pa j, γ = 1/ X θj of (43)), and MLFS global (β = 1/ Pa j, γ = 1/ X θ of (42)). Fig. 4 shows the experimental results on four different datasets in Exact-Match and Macro-F1. We increased the number M of selected features from 10% to 100% by step 10% of min{d, 200}. The local algorithm, MLFS local, outperformed the other ones on average, and its performance was more stable to the value change of M compared with other algorithms. Among the global algorithms, MLFS global worked better than MLMIM, MLMRMR and MLJMI. Note that the performance of MLFS local and MLFS global achieved significant advantage with a smaller value of M, probably because of the success on modeling label correlations in (42). Similar pattern can be observed in other metrics. In summary, performing local feature selection is more effective 24

25 scene 0.7 medical 0.15 enron 0.16 rcv1s1 Exact-Match MLFS local MLFS global MLMIM MLMRMR MLJMI M Exact-Match MLFS local MLFS global MLMIM MLMRMR MLJMI M Exact-Match 0.1 MLFS local MLFS global MLMIM MLMRMR MLJMI M Exact-Match MLFS local MLFS global MLMIM MLMRMR MLJMI M 0.8 scene 0.4 medical 0.25 enron 0.25 rcv1s Macro-F MLFS local MLFS global MLMIM MLMRMR MLJMI M Macro-F MLFS local MLFS global MLMIM MLMRMR MLJMI M Macro-F MLFS local MLFS global MLMIM MLMRMR MLJMI M Macro-F MLFS local MLFS global MLMIM MLMRMR MLJMI M Figure 4: Comparing five information theoretic feature selection algorithms on four datasets in Exact-Match (the top row) and Macro-F1 (the bottom row). The number M of selected features is chosen from 10% to 100% by step 10% of min{d, 200}. than the global way, and it is important to take label correlations into account. Thus, we shall only employ MLFS local in the following experiments. To evaluate the potential of occ, we further conduct experiments by varying its two parameters: the number k of parents and the number M of selected features. The experimental results on the four datasets in Exact-Match and Macro-F1 are shown in Fig. 5. We select the values of k and M according to the percentages from 5% to 100% by step 5%. In terms of k, we can see that the performance becomes stable when a small percent of parents are selected. In terms of M, occ achieved best performances when a fraction of features are selected. The results verify that the existence of useless parents and features On chain order selection We performed another experiment to evaluate the performance of chain order selection developed in Algorithm 2. In this experiment, we incorporated Algorithm 2 with occ (Algorithm 1), and set the two parameters of occ as k = q and M = d. To simplify the computation of Algorithm 2, we reduced the dimensionality of features to 1 by Principle Component Analysis (PCA) and ignored the second term of (47). We first compared occ (with canonical label order Y 1 Y 2 Y q 1 Y q ), occ cos (with chain order selection) and EoCC 5 (ensemble of five occs with randomly selected orders). As shown in Fig. 6, 25

Active Learning for Sparse Bayesian Multilabel Classification

Active Learning for Sparse Bayesian Multilabel Classification Active Learning for Sparse Bayesian Multilabel Classification Deepak Vasisht, MIT & IIT Delhi Andreas Domianou, University of Sheffield Manik Varma, MSR, India Ashish Kapoor, MSR, Redmond Multilabel Classification

More information

Progressive Random k-labelsets for Cost-Sensitive Multi-Label Classification

Progressive Random k-labelsets for Cost-Sensitive Multi-Label Classification 1 26 Progressive Random k-labelsets for Cost-Sensitive Multi-Label Classification Yu-Ping Wu Hsuan-Tien Lin Department of Computer Science and Information Engineering, National Taiwan University, Taiwan

More information

A Deep Interpretation of Classifier Chains

A Deep Interpretation of Classifier Chains A Deep Interpretation of Classifier Chains Jesse Read and Jaakko Holmén http://users.ics.aalto.fi/{jesse,jhollmen}/ Aalto University School of Science, Department of Information and Computer Science and

More information

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification Robin Senge, Juan José del Coz and Eyke Hüllermeier Draft version of a paper to appear in: L. Schmidt-Thieme and

More information

Multi-label learning from batch and streaming data

Multi-label learning from batch and streaming data Multi-label learning from batch and streaming data Jesse Read Télécom ParisTech École Polytechnique Summer School on Mining Big and Complex Data 5 September 26 Ohrid, Macedonia Introduction x = Binary

More information

Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss

Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss Krzysztof Dembczyński 1, Willem Waegeman 2, Weiwei Cheng 1, and Eyke Hüllermeier 1 1 Knowledge

More information

Machine Learning Lecture Notes

Machine Learning Lecture Notes Machine Learning Lecture Notes Predrag Radivojac January 25, 205 Basic Principles of Parameter Estimation In probabilistic modeling, we are typically presented with a set of observations and the objective

More information

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with Feature selection c Victor Kitov v.v.kitov@yandex.ru Summer school on Machine Learning in High Energy Physics in partnership with August 2015 1/38 Feature selection Feature selection is a process of selecting

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Classifier Chains for Multi-label Classification

Classifier Chains for Multi-label Classification Classifier Chains for Multi-label Classification Jesse Read, Bernhard Pfahringer, Geoff Holmes, Eibe Frank University of Waikato New Zealand ECML PKDD 2009, September 9, 2009. Bled, Slovenia J. Read, B.

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

arxiv: v1 [cs.lg] 24 Feb 2016

arxiv: v1 [cs.lg] 24 Feb 2016 Feature ranking for multi-label classification using Markov Networks Paweł Teisseyre Institute of Computer Science, Polish Academy of Sciences Jana Kazimierza 5 01-248 Warsaw, Poland arxiv:1602.07464v1

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

Information Theory and Feature Selection (Joint Informativeness and Tractability)

Information Theory and Feature Selection (Joint Informativeness and Tractability) Information Theory and Feature Selection (Joint Informativeness and Tractability) Leonidas Lefakis Zalando Research Labs 1 / 66 Dimensionality Reduction Feature Construction Construction X 1,..., X D f

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Adaptive Crowdsourcing via EM with Prior

Adaptive Crowdsourcing via EM with Prior Adaptive Crowdsourcing via EM with Prior Peter Maginnis and Tanmay Gupta May, 205 In this work, we make two primary contributions: derivation of the EM update for the shifted and rescaled beta prior and

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Kernel Density Topic Models: Visual Topics Without Visual Words

Kernel Density Topic Models: Visual Topics Without Visual Words Kernel Density Topic Models: Visual Topics Without Visual Words Konstantinos Rematas K.U. Leuven ESAT-iMinds krematas@esat.kuleuven.be Mario Fritz Max Planck Institute for Informatics mfrtiz@mpi-inf.mpg.de

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Lab 12: Structured Prediction

Lab 12: Structured Prediction December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?

More information

In: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999

In: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999 In: Advances in Intelligent Data Analysis (AIDA), Computational Intelligence Methods and Applications (CIMA), International Computer Science Conventions Rochester New York, 999 Feature Selection Based

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

More information

RaRE: Social Rank Regulated Large-scale Network Embedding

RaRE: Social Rank Regulated Large-scale Network Embedding RaRE: Social Rank Regulated Large-scale Network Embedding Authors: Yupeng Gu 1, Yizhou Sun 1, Yanen Li 2, Yang Yang 3 04/26/2018 The Web Conference, 2018 1 University of California, Los Angeles 2 Snapchat

More information

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs Professor Erik Sudderth Brown University Computer Science October 6, 2016 Some figures and materials courtesy

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Dimension Reduction Using Nonnegative Matrix Tri-Factorization in Multi-label Classification

Dimension Reduction Using Nonnegative Matrix Tri-Factorization in Multi-label Classification 250 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'15 Dimension Reduction Using Nonnegative Matrix Tri-Factorization in Multi-label Classification Keigo Kimura, Mineichi Kudo and Lu Sun Graduate

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Conditional Likelihood Maximization: A Unifying Framework for Information Theoretic Feature Selection

Conditional Likelihood Maximization: A Unifying Framework for Information Theoretic Feature Selection Conditional Likelihood Maximization: A Unifying Framework for Information Theoretic Feature Selection Gavin Brown, Adam Pocock, Mingjie Zhao and Mikel Lujan School of Computer Science University of Manchester

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging 10-708: Probabilistic Graphical Models 10-708, Spring 2018 10 : HMM and CRF Lecturer: Kayhan Batmanghelich Scribes: Ben Lengerich, Michael Kleyman 1 Case Study: Supervised Part-of-Speech Tagging We will

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Decision trees COMS 4771

Decision trees COMS 4771 Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

More information

Regret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss

Regret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss Regret Analysis for Performance Metrics in Multi-Label Classification: The Case of Hamming and Subset Zero-One Loss Krzysztof Dembczyński 1,3, Willem Waegeman 2, Weiwei Cheng 1, and Eyke Hüllermeier 1

More information

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation Lecture 15. Pattern Classification (I): Statistical Formulation Outline Statistical Pattern Recognition Maximum Posterior Probability (MAP) Classifier Maximum Likelihood (ML) Classifier K-Nearest Neighbor

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment.

In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment. 1 Introduction and Problem Formulation In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment. 1.1 Machine Learning under Covariate

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Introduction to Bayesian Learning

Introduction to Bayesian Learning Course Information Introduction Introduction to Bayesian Learning Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Apprendimento Automatico: Fondamenti - A.A. 2016/2017 Outline

More information

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

More information

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang. Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

More information

Tufts COMP 135: Introduction to Machine Learning

Tufts COMP 135: Introduction to Machine Learning Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)

More information

Probabilistic Time Series Classification

Probabilistic Time Series Classification Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Online Estimation of Discrete Densities using Classifier Chains

Online Estimation of Discrete Densities using Classifier Chains Online Estimation of Discrete Densities using Classifier Chains Michael Geilke 1 and Eibe Frank 2 and Stefan Kramer 1 1 Johannes Gutenberg-Universtität Mainz, Germany {geilke,kramer}@informatik.uni-mainz.de

More information

Feature Selection based on the Local Lift Dependence Scale

Feature Selection based on the Local Lift Dependence Scale Feature Selection based on the Local Lift Dependence Scale Diego Marcondes Adilson Simonis Junior Barrera arxiv:1711.04181v2 [stat.co] 15 Nov 2017 Abstract This paper uses a classical approach to feature

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Adapted Feature Extraction and Its Applications

Adapted Feature Extraction and Its Applications saito@math.ucdavis.edu 1 Adapted Feature Extraction and Its Applications Naoki Saito Department of Mathematics University of California Davis, CA 95616 email: saito@math.ucdavis.edu URL: http://www.math.ucdavis.edu/

More information

Generative MaxEnt Learning for Multiclass Classification

Generative MaxEnt Learning for Multiclass Classification Generative Maximum Entropy Learning for Multiclass Classification A. Dukkipati, G. Pandey, D. Ghoshdastidar, P. Koley, D. M. V. S. Sriram Dept. of Computer Science and Automation Indian Institute of Science,

More information

Parameter learning in CRF s

Parameter learning in CRF s Parameter learning in CRF s June 01, 2009 Structured output learning We ish to learn a discriminant (or compatability) function: F : X Y R (1) here X is the space of inputs and Y is the space of outputs.

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Pengtao Xie, Machine Learning Department, Carnegie Mellon University 1. Background Latent Variable Models (LVMs) are

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Appendices for the article entitled Semi-supervised multi-class classification problems with scarcity of labelled data

Appendices for the article entitled Semi-supervised multi-class classification problems with scarcity of labelled data Appendices for the article entitled Semi-supervised multi-class classification problems with scarcity of labelled data Jonathan Ortigosa-Hernández, Iñaki Inza, and Jose A. Lozano Contents 1 Appendix A.

More information

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization Haiping Lu 1 K. N. Plataniotis 1 A. N. Venetsanopoulos 1,2 1 Department of Electrical & Computer Engineering,

More information

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Dimension Reduction (PCA, ICA, CCA, FLD,

Dimension Reduction (PCA, ICA, CCA, FLD, Dimension Reduction (PCA, ICA, CCA, FLD, Topic Models) Yi Zhang 10-701, Machine Learning, Spring 2011 April 6 th, 2011 Parts of the PCA slides are from previous 10-701 lectures 1 Outline Dimension reduction

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2012-2013 Jakob Verbeek, ovember 23, 2012 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.12.13

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information