Multi-dimensional classification via a metric approach

Size: px
Start display at page:

Download "Multi-dimensional classification via a metric approach"

Transcription

1 Multi-dimensional classification via a metric approach Zhongchen Ma a, Songcan Chen a, a College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics (NUAA), Nanjing , China Abstract Multi-dimensional classification (MDC) refers to learning an association between individual inputs and their multiple dimensional output discrete variables, and is thus more general than multi-class classification (MCC) and multi-label classification (MLC). One of the core goals of MDC is to model output structure for improving classification performance. To this end, one effective strategy is to firstly make a transformation for output space and then learn in the transformed space. However, existing transformation approaches are all rooted in label power-set (LP) method and thus inherit its drawbacks (e.g. class imbalance and class overfitting). In this study, we first analyze the drawbacks of the LP, then propose a novel transformation method which can not only overcome these drawbacks but also construct a bridge from MDC to MLC. As a result, many off-the-shelf MLC methods can be adapted to our newly-formed problem. However, instead of adapting these methods, we propose a novel metric learning based method, which can yield a closed-form solution for the newly-formed problem. Interestingly, our metric learning based method can also naturally be applicable to MLC, thus itself can be of independent interest as well. Extensive experiments justify the effectiveness of our transformation approach and our metric learning based method. Keywords: Multi-dimensional classification, problem transformation, distance metric learning, closed-form solution Corresponding author address: s.chen@nuaa.edu.cn (Songcan Chen) Preprint submitted to Journal of LATEX Templates September 21, 2017

2 1. Introduction 5 10 In supervised learning, binary classification (BC), multi-label classification (MLC) and multi-class classification (MCC) have been extensively studied in past years. As a more general learning task, MDC is relatively less studied up to now, which can partly be attributed to its more complex output space. Figure 1 displays the relationships among the different classification paradiams in term of m class variables of K possible values each. As shown in the figure, BC has only single class variable whose range is {1, 0} or {1, 1}, corresponding to m = 1 and K = 2; MCC also has only single class variable but whose range can take a number of class values, corresponding to m = 1 and K > 2; MLC has multiple class variables whose range is also {1, 0} or {1, 1}, corresponding to m > 1 and K = 2. More generally, MDC allows multiple class variables that can take a number of class values, corresponding to m > 1 and K > 2. m 1 MLC MDC m 1 BC MCC K 2 K Figure 1: Relationship between different classification paradigms, where m is the number of class variables and K is the Fig. number 1. 不同的分类问题范例 of values each :L of 代表标签的数量和 these variables may take. K 代表每个标签的取值范围 A wide range of applications has been found corresponding to this task, for example, in computer vision task [1], a landscape image may present many information such as the month, season, or the type of subject; in information retrieval task [2][3], documents can be classified into different kinds of categories like mood or topic; in computational advertising task[4], a social media information may demonstrate the user s gender, age, personality, happiness or political polarity. Like MLC, the core goal of MDC is to achieve effective classification performance by modeling output structure. In modeling, a simplest assumption is 2

3 that the class variables are completely unrelated, thus it is sufficient to design a separate independent model for each class. However, such an ideal assumption is hardly applicable to real world problems in general, as correlation (structure) often exists among class variables, for example, a user s age can have strong impact on his political polarity where the young are generally more radical and elders are often more conservative. Even within each output dimension, there exists an explicit within-dimension relationship among its values, which refers to that only one value of a class variable can be activated. Therefore, one key to achieve its effective learning lies in how to take sufficient advantage of explicit and/or implicit relationships both among output dimensions and among values within each output dimension. In order to model such output structures, there are two main strategies proposed: (i) explicitly modeling the dependence structures between class variables, e.g., via imposing chain structure [5][6][7], or using multi-dimensional Bayesian network structure [8][9] or adopting Markov random field[10] (ii) implicitly modeling output structure by transformation approaches [11][12][13]. A major limitation of the former strategy lies in requiring a pre-defined output structure (e.g., chain or Bayesian network), thus partly losing flexibility of characterizing structure. In contrast, the transformation approach of the latter strategy enjoys more flexibility due to its ability to modeling various structures. What s more, we also witness that such a transformation method has demonstrated its convincing performance in [13]. Therefore, in this paper, we follow such a transformation strategy to model output structures of MDC. To the best of our knowledge, all the existing transformation methods can be classified as label power-set (LP)-based transformation approach. LP [11] can transform the MDC problem into a corresponding multi-class classification problem by defining a new compound class variable whose range exactly contains all the possible combinations of values of the original class variables. Though implicitly considering the interaction between different classes, LP suffers from class imbalance and class overfitting problems, where the class imbalance refers to the great differences in the total number of instances for different combi- 3

4 nations of the class variables, and the class overfitting problem refers to zero instances for some combinations of the class variables. To address these issues of LP, [13] proposed to firstly form super-class partitions by modeling the dependence between class variables and then make each super-class partition correspond to a compound class variable defined by LP. Although this superclass partitioning can reduce the the original problem to a set of subproblems, these newly formed subproblems still need to be transformed by LP, thereby, the approach naturally suffers its problems. In this study, we analyze the drawbacks of the LP and propose a novel transformation method which can not only overcome these drawbacks but also construct a bridge from MDC to MLC. Specifically, our transformation approach desires to form a new output space with all binary variables by a tricky binarization process for the original output space of MDC. Since such a newlyformed problem has a similarity with MLC (e.g. the class variables of both problems are all binary), our transformation approach is named as Multi-Label like Transformation approach (MLKT) and subsequently, many off-the-shelf MLC methods can be adapted to the newly-formed problem. However, instead of adapting these methods, in this study, we also propose a novel metric-based method of aiming to make the predictions of an instance in the learned metric space close to its true class values while far away from others. And, our metric-based method involved can yield a closed-form solution, thus its learning is more efficient than its competitive methods. Interestingly, our metric learning method involved can also naturally be applicable to MLC, thus itself can be of independent interest as well. Finally, extensive experimental results justify that: our approach combing the above two procedures achieves a better classification performance than the state-of-the-art MDC methods while our method itself also obtains competitive classification performance with a lower learning complexity compared to its counterparts designed specifically for MLC. The rest of the paper is structured as follows: We firstly introduce the required background in the field of multi-dimensional classification in Section 2. Then we introduce MLKT in Section 3. Next, we present the details of the 4

5 85 distance metric learning method in Section 4. We then experimentally evaluate the proposed schemes in Section 5. Finally, we give concluding remarks in Section Background In this section, we review basic multi-dimensional classifiers. In MDC, we have N labeled instances D = {(x i, y i )} N i=1 from which we wish to build a classifier that associates multiple class values with each data instance. The data instance is represented by a vector of d values x = (x 1,..., x d ), each drawn from some input domain X 1 X d. And the classes are represented by a vector of m values y = (y 1,... y m ) from the domain Y 1 Y m where each Y j = {1,..., K j } is the set of possible values for the jth class variable Y j. Specifically, we seek to build a classifier f that assigns each instance x to a vector y of class values: f : X 1 X d Y 1 Y m x : (x 1,..., x d ) y : (y 1,... y m ) Binary Relevance (BR) is a straightforward method for MDC. It trains m classifiers f := (f 1,..., f m ) for each class variable. Specifically, a standard multi-class classifier f j learns to associate one of the values y j Y j to each data instance, where f j : X 1 X d Y j. However, it is unable to capture the dependencies among classes and suffers low accuracies as illustrated in [5, 14, 15]. MDC has attracted much attentions recently and many multi-dimensional classifiers for modeling the output structure of MDC have been proposed in recent years. As presented in the introduction section, there are two main strategies: 1. Explicit representation of the dependence structure between class variables. 5

6 Classifier chains model (CC)[5, 6, 7], Classifier trellises (CT) [14] and Multi-dimensional Bayesian network classifiers (MBCs) [8][9] were recently proposed methods following this strategy for MDC. Specifically, classifier chains model (CC) learns m classifiers, one for each class variable. These classifiers are linked at random order, such that the jth classifier uses as input features not only the instance, but also the output predictions of the previous j 1 classifiers, namely ŷ j = f j (x, ŷ 1,..., ŷ j 1 ) for any test instance x. Specifically: f j : X 1 X d Y 1 Y j 1 Y j This method has demonstrated high performance in multi-label domains and is directly applicable to MDC. However, a drawback is that the class variable ordering in chain has a strong effect on predictive accuracy, and with greedy structure comes the concern of error propagation along the chain due to that an incorrect estimate ŷ j will negatively affect all subsequent class variables. Naturally, the ensemble strategy (ECC) [7], which trains several CC classifiers with random order chains, can be used to alleviate these problems. classifier trellises (CT) captures dependencies among class variables by considering a predefined trellis structure. Each of the vertices of the trellis corresponds to one of the class variables. Fig.2 shows a simple example of the structure where the parents of each class variable are the class variables laying on the vertices above and to the left in the trellis. Specifically: f d : X 1 X d Y b Y c Y d. 115 CT can scale large data sets with reasonable complexity. However, just like CC, the artificially-defined greedy structure may falsely reflect the real dependency among class variables, thus limit its classification performance in real world applications unless such a predefined structure coincides with the given problem. Multi-dimensional Bayesian network classifiers(mbcs) are a family of probabilistic graphical models which organizes class and feature variables 6

7 Figure 2: A simple example of Classifier Trellises [14] as three different subgraphs: class subgraph, feature subgraph, and bridge (from class to features) subgraph. Different graphical structures for the class and the feature subgraphs can lead to different families of MBCs. We show a simple tree-tree structure of MBC in Fig.3. In recent years, various MBCs have been proposed and have become useful tools for modeling output structures of MDC [8, 9, 16]. However, the problem is still its exponential computational complexity. Figure 3: A simple tree-tree structure of MBC [8][9] Implicit incorporation of output structure by transforming output space. To the best of our knowledge, all the existing transformation methods for MDC can be classified as label power-set (LP)-based transformation approach. Label power-set (LP) [11] is a typical transformation approach for MLC and also can be directly applied to MDC. It firstly forcefully assumes all the class variables are dependent and then defines a new compound class variable whose range contains all the possible combinations of values of the original class variables. Specifically, f : X 1 X d Cartesian product(y 1,..., Y m ). 7

8 135 As a result, the original problem is turned into a multi-class classification problem which has many off-the-shelf methods available. In this way, the output structure of MDC is implicitly considered. However, LP easily suffers from class overfitting and class imbalance problems as mentioned in the introduction section. Random k-labelsets (RAkEL) [17] and Super classes Classifier (SCC) [13] are LP-based transformation approaches, where RAkEL uses multiple LP classifiers with each trained on a random subset of Y, while SCC s LP classifiers are trained on the subsets of class variables with strong dependency. Formally, given a subset class variable S of Y, then its corresponding LP classifier is learned, namely: f S : X 1 X d Cartesian product(y S ). Note that RAkEL is specially designed for MLC, but can straightly be applied to MDC. While SCC has demonstrated convincing classification performance for MDC. However, both methods need to resort to LP, therefore naturally suffering its same problems to some extend In summary, the above two strategies have their own advantages and drawbacks. However, relatively speaking, the latter one enjoys more flexibility in modeling output structure. Therefore, in this paper, we follow the latter one to model output structure of MDC. Unfortunately, existing transformation approaches are all based on LP, thus naturally inherit its class overfitting and class imbalance problems. Thanks to these, we try to give an analysis for the drawbacks of LP and propose a novel transformation approach to overcome them. 3. Transformation for MDC 3.1. Analysis for LP 150 We now give an analysis for LP and detail the causes of its drawbacks. 8

9 LP is a typical transformation approach for MDC. It defines a new compound class variable whose range contains all the possible combinations of values of the original class variables. In this way, the original MDC problem is turned into a multi-class classification problem which is relatively easier for subsequent 155 learning. However, such transformation has two serious drawbacks: The first drawback is class overfitting, which is invoked by the great reduction of the number of instances for each class after transformation. Specifically, given a dataset {x i, y i } N i=1 of a MDC problem which has m class variables with each having K i class values. Clearly, the number of instances for each class of the i-th output dimension is 1 K i N in the original problem and the worst case of the number is 1 K max N, where K max = max(k i ). However, after the transformation, the number of instances for each class becomes 1 m i=1 Ki N, which is far less than 1 K max N due to m i=1 Ki K max. Hence, this drawback makes learning for the formed problem easy class overfitting. The second drawback is class imbalance, which is invoked by the reduction of balance degree (refers to the smallest ratio of the total number of instances between classes) after transformation. More specifically, given a dataset of an imbalance MDC problem which has m class variables Y 1,..., Y m, if we assume the balance degrees of class variables are respectively p 1,..., p m. After LP 170 transformation, the balance degree changes to p 1 p m. Based on p i < 1, i {1,..., m}, we can get p 1 p m < min(p 1,..., p m ). Thus the balance degree in the formed problem is worse than that in the original problem. In nature, although LP does not transform a totally balanced MDC problem into an imbalanced problem, it can indeed transform almost balanced MDC problem 175 into imbalanced problem. What s more, from the above analysis, we find that the more class variables the LP compound class variable includes, the more serious the class overfitting and class imbalance problems become. To overcome the drawbacks of LP, we propose a transformation approach to 180 1) make the number of instances for each class in the transformed problem as similar as possible to that in the original MDC problem; 2) keep the balance 9

10 degree in the formed problem as consistent as possible with that in the original MDC problem The procedure of MLKT 185 In this subsection, we present a novel transformation approach, namely M- LKT, which transforms Y to a subspace of {0, 1} L (where L is the dimensionality of the formed problem). Our approach inherits the following favorable characteristics of LP: keep the output space size invariant. 190 is easy for subsequent modeling in the transformed space. can reflect the explicit within-dimensional relationship. In addition, it also possesses two extra key characteristics, i.e., it can overcome the class overfitting and class imbalance problems suffered by LP it is decomposable for each class variable of MDC. Among above characteristics, we attempt to make the transformation decomposable, aiming to make easy transformation implementing and distinctive learning for each output variable. By doing so, we can ease the unnecessary computation cost of LP when the correlations between some output variables are not strong. Now, let us detail the procedures of MLKT: For each individual class variable of MDC, if K i 3, then for each y i Y i = {1,..., K i }, construct a new K i - dimensional class vector ẑ i where ẑ i j = 1 if j = yi, 0 otherwise. 205 if K i = 2, then for each y i Y i = {1, 2}, construct a 1-dimensional class vector ẑ i where ẑ i = 0 if y i = 1 and ẑ i = 1 if y i = 2 With the above transformation, the original output class vector y = (y 1,..., y m ) Y is converted to a corresponding class vector ẑ = [ẑ 1 ;... ; ẑ m ], i.e. ẑ is obtained 10

11 210 by concatenating corresponding m vectors in the ascending index order. Thus, we form a new output domain: Ẑ = {ẑ y y Y}, where denotes transformed by MLKT. Clearly, Ẑ {0, 1} L, where L is the dimensionality of ẑ. Let us give an example to help understanding MLKT transformation. Example 1. Assume the output space of MDC is Y = Y 1 Y 2, where Y 1 := {1, 2, 3}, Y 2 := {1, 2} transformation for each individual class domain. Y 1 := {1, 2, 3} {(1, 0, 0) T, (0, 1, 0) T, (0, 0, 1) T } Y 2 := {1, 2} {0, 1} 2. concatenating procedure. Ẑ = {(1, 0, 0, 0) T, (1, 0, 0, 1) T, (0, 1, 0, 0) T, (0, 1, 0, 1) T, (0, 0, 1, 0) T, (0, 0, 1, 1) T } Clearly, it is not easy for subsequent learning in the new formed space. However, we observe that the new formed space is equivalent to {0, 1} L with some additionally-imposed constraints which is relatively easier for subsequent learning. These additionally-imposed constraints can be obtained based on a deep insight as follows: Let us define an integer set φ i = {φ i 1, φ i 2,..., φ i j,... } for each i = 1,..., m, such that ẑ (φi ) = ẑ i where the elements of set φ i represent the indices of ẑ corresponding to class variable domain Y i. Based on the transformation characteristics of MLKT, we find that when K i 3, the vector ẑ (φ i ) has one and only one element to be 1, thus, we have ẑ (φ i j ) = 1, ẑ Ẑ. (1) j φ i In fact, we can use E T z = t to formulate all the equalities, where E R L m is an indicator matrix whose element 1, if k = φ i j E ki = Ki 3 (2) 0, otherwise 11

12 and t is a m dimensional vector whose element 1, if K i 3 t i = 0, otherwise (3) Next, we prove in Proposition 1 that the output domain of the newly-formed problem Ẑ is in fact equivalent to: Z = {E T z = t z {0, 1} L } (4) 225 where E and t are defined in Eq.(2) and Eq.(3), respectively. As a result, leading to the following Proposition 1: Proposition 1. The output domain Ẑ formed by MLKT is equivalent to Z = {E T z = t z {0, 1} L } where E is a predefined indicator matrix and t is a predefined m-dimensional vector. Proof. Based on Eq.(1), we find that z Ẑ, z Z. Therefore, we just need to 230 prove that the size of Z is consistent with that of Y (which means the consistency with that of Ẑ as well). Assume the original MDC problem has m class variables, with each having K 1, K 2,..., K m possible values. Note that the MLKT transformation is decomposable for each class variable, thus we just need to prove that the space of Z φi is also K i. In the following, we give a proof according to several cases: Case 1: if K i = 2, the class variable domain Y i is transformed to Z (φi) = {0, 1} 1, thus its space size is also 2. Case 2: if K i 3, the class variable domain Y i is transformed to Z (φi ) = { 1, z (φ i ) = 1 z (φ i ) {0, 1} Ki } in terms of our setup in Eq.(1), where 1 is a K i -dimensional vector with all elements being 1 and, represents the inner product between two vectors. Because the equality constraint can make sure that the vector z (φi ) has one and only one element to be 1, the space size of Z (φ i ) is also K i. 245 To further help understanding MLKT transformation from Y to Z = {E T z = t z {0, 1} L }, we give an example as follows: 12

13 Example 2. Assume the output space of MDC is Y = Y 1 Y 2, where Y 1 := {1, 2, 3}, Y 2 := {1, 2}. The MLKT approach follows two steps: Define φ i and L. 2. Define an indicator matrix E and a vector t. According to Eq.(4), we can get: φ 1 = {1, 2, 3}, φ 2 = {4}, L = 4, E 1 = (1, 1, 1, 0) T, E 2 = (0, 0, 0, 0) T, E = [E 1, E 2 ], t = [1, 0] T, respectively. Thus, Z can be defined as Z = {E T z = t z {0, 1} 4 } In terms of such a transformation, the number of instances for each class in the formed problem seems to be N 2, which is far more than that in the problem formed by LP, i.e., 1 d i=1 Ki N. Moreover, on the surface, although it is also more than that in the original problem, i.e., 1 K i N, this is not so. Based on the decomposability of MLKT and the consistent size of Z (φ i ) with that of Y i, so if the original MDC problem has N i instances categorized as y i, then the formed problem also has N i instances categorized as z (φi ), meaning that the formed problem keeps consistent with the original MDC problem not only the number of instances for each class but also the balance degree. Naturally, MLKT avoids the class overfitting and the class imbalance problems. Moreover, the explicit within-dimensional relationship is reflected by common-used one-vs-all coding [18]. In this way, MLKT can make all the desired transformation characteristics guaranteed. From now on, we just need to focus on learning the structure of the problem formed by MLKT to reveal original structure Learning for the transformed problem Notations Firstly, we give the notations which will be used in the following section. For a matrix A R p q, we use A F = i j A2 ij to denotes its Frobenius norm. For a positive definite matrix B 0, B 1 denotes the inverse of matrix B. And, we use x 1 x 2 B = (x 1 x 2 ) T B(x 1 x 2 ) to denote the Mahalanobis distance between vectors x 1 and x 2. 13

14 4.1. Model construction and Optimization Given the training data instances D = {(x i, y i )} N i=1 (y i Y), we can obtain the corresponding re-labeled instances D = {(x i, z i )} N i=1 (z Z) by MLKT transformation. Then, the left problem is to build a classifier g that assigns each instance x a vector z of class values: x i : (x 1 i,..., x d i ) z i : (zi 1,... zi L ) Let X R N d denote the input matrix and Z Z N L denote the output matrix. To solve the transformed problem, a simple linear regression model is to learn matrix P through the following formulation: 1 arg min P R d L 2 Z XP 2 F +γ P 2 F, (5) Here γ 0 is a regularization parameter. However, this method usually yields low classification performance due to lack of consideration of correlations in output space [19]. Considering the correlations, [20, 19] proposed to learn a discriminative Mahalanobis distance metric which can make the distance between P T x i and z i less than that between P T x i and any other output z in the output space. Unfortunately, both [20, 19] can not be directly applicable to our transformed problem, we, instead, develop an alternative novel metric learning method well suited to our scenario and it can nicely obtain a closed form solution (Our Mahalanobis metric learning method is similar to [20, 19] and we detail the connections in section 4.5.). Its formulation is presented as follows: arg min Ω 0 i, z i Z\z i P T x i z i 2 Ω + 1 Z\z i PT x i z i 2 Ω 1, (6) where P is the solution of the linear regression model (5), Ω is a positive definite 280 matrix. In the above, the first term aims to make the distance between P T x i and z i smaller and the second term the distance between P T x i and any other output z larger. The main idea of using Ω 1 is motivated by [21], where Ω 1 is used to measure the distances between dissimilar points. The goal is to increase 14

15 the Mahalanobis distance between P T x i and any other output z by decreasing P T x i z 2 Ω (see Proposition 1. [21]). 1 Because the space size of Z grows exponentially with dimension L, we only consider the k nearest neighbors (knn) of z i in the training dataset instead of any other outputs in the whole output space. Moreover, a regularizer term is used for avoiding overfitting. Therefore, we present the formulation of the distance metric learning method as follows: arg min Ω 0 λd sld (Ω, I)+ P T x i z i 2 Ω + 1 k PT x i z i 2 Ω 1 i, z i knn(z i)\z i (7) where γ 0, P is fixed and the solution of (5), I is the identity matrix and D sld (Ω, I) is the symmetrized LogDet divergence: D sld (Ω, I) := tr(ω) + tr(ω 1 ) 2L. Further define: S := i (P T x i z i )(P T x i z i ) T (8) 285 D := i, z i knn(z i)\z i 1 k (PT x i z i )(P T x i z i ) T (9) Using both S and D, the minimization problem (7) can be recast as arg min Ω 0 λd sld (Ω, I) + tr(ωs) + tr(ω 1 D). (10) Interestingly, the minimization problem (10) is the same as the problem (13) of [21], and is both strictly convex and strictly geodesically convex (Theorem 3 of [21]), thus having global optimal solution. What s more, it can have a closed form solution below: Ω = (S + λi) 1 1/2 (D + λi), (11) where A 1/2 B := A 1/2 (A 1/2 BA 1/2 ) 1/2 A 1/2. 15

16 It is this fact that the solution is given by the midpoint of the geodesic joining (S+λI) 1 and (D+λI). The geodesic viewpoint is important to make a tradeoff between (S + λi) 1 and (D + λi). Note that Ω := (S + λi) 1 1/2 (D + λi) is also the minimum of problem (12) according to [21]: arg min δ 2 R(Ω, (S + λi) 1 ) + δ 2 R(Ω, (D + λi)), (12) where δ R denotes the Riemannian distance δ R (U, V) := log(v 1/2 UV 1/2 ) F, for U, V 0 Thus, we can get the balanced version between S and D of problem (10): arg min Ω 0 := (1 t)δ2 R(Ω, (S + λi) 1 ) + δ 2 R(Ω, (D + λi)), t [0, 1]. (13) Interestingly, it can be shown (see [22], ch.6) that the unique solution to problem (13) is where A t B := A 1/2 (A 1/2 BA 1/2 ) t A 1/2. Ω = (S + λi) 1 t (D + λi) (14) The solution connects to the Riemannian geometry of symmetric positive definite (SPD) matrices, and thus we denote it as gmml. Totally, we detail the 290 learning procedure in Algorithm 1. Algorithm 1 MLKT-gMML algorithm Input: The MDC training data set D = {(x i, y i )} N i=1 (y i Y); The preset hyper-parameters k, λ, γ and t Output: Regression matrix P and distance metric Ω; 1: Transform D to D by MLKT approach: D = {(x i, z i )} N i=1 (z Z) 2: Set P := arg min 1 P R d L 2 Z XP 2 +γ P 2 F ; 3: Compute S and D by Eq.(8) and Eq.(9) 4: Set Ω := (S + λi) 1 t (D + λi); 5: return P and Ω; Note that P has an impact on learning Ω, conversely, Ω has an impact on learning P as well. Thus, P can be obtained by optimizing the following 16

17 problem: 1 P := arg min P R d L 2 Z XP 2 Ω +γ P 2 F. (15) Its solution can be boiled down to solving the Sylvester equation: (X T X)P + γpω 1 = X T Y. A classical algorithm for solving such equation is the Bartels- Stewart algorithm [23]. In a nutshell, an iteration algorithm for learning P and Ω is detailed in Algorithm 2 called as gmml-i. Algorithm 2 gmml-i algorithm Input: The MDC training data set D = {(x i, y i )} N i=1 (y i Y); The number of iterations η; The preset hyper-parameters k, λ, γ and t Output: Regression matrix P and distance metric Ω; 1: Transform D to D by MLKT approach: D = {(x i, z i )} N i=1 (z Z) 2: Set Ω init = I 3: repeat 4: Set P := arg min P R d L 1 2 Z XP 2 Ω +γ P 2 F ; 5: Compute S and D by Eq.(8) and (9) 6: Set Ω := (S + λi) 1 t (D + λi); 7: until (reach to η iterations) 8: return P and Ω; Prediction for a new instance Based on the learning procedure, the output z of a new instance x can be predicted by solving the following optimization problem: 1 min z Z 2 z PT x 2 Ω (16) It is equivalent to solving a quadratic binary optimization problem with equality constraints, namely, 1 min z 2 z PT x 2 Ω s.t. E T z = t (17) z {0, 1} L 17

18 The optimization problem (17) is very difficult to solve due to its NPhardness. Instead we replace the binary constraints with 0 z 1, then the NP-hard optimization problem is converted to a simple box-constrained quadratic programming as follows 1 min v 2 v PT x 2 Ω s.t. E T v = t (18) v [0, 1] L Now, for each set φ i, the prediction z (φ i ) of x can be made in terms of 1, if j = arg max k v (φ i z (φ i j ) = k ), if φ i 3, k = 1,..., K i 0, otherwise z (φi ) = round(v (φi )), if φ i = 1 (19) Where round() means rounding their predictions into 0/1 assignments. In turn, the prediction y i of x in the original output space is y i = j, if φ i 3 y i = (z (φ i ) + 1), if φ i = 1 Algorithm 3 details the predicting procedures. (20) Algorithm 3 Predict new instance x Input: The learned regression matrix P and distance metric Ω; The new instance x Output: The prediction class vector y 1: Solve z := arg min z Z z P T x 2 Ω ; 2: Inverse transformation: y z according to Eq.(19) and Eq.(20); 3: return y; 4.3. Connections between existing metric learning methods and ours 300 Two works are mostly related to our metric learning method, namely max- imum margin output coding (MMOC) [20] and large margin metric learning 18

19 305 with knn constraints (LM-kNN) [19]. These two methods likewise use a Mahalanobis distance metric (a symmetric positive semidefinite matrix denoted by S + ) to model output structure of MLC, where the Mahalanobis distance metric is used to learn a lower dimensional space. MMOC aims to learn a discriminative Mahalanobis metric which can make the distance between P T x i and its real class vector z i as close to 0 as possible and less than the distance between P T x i and any other outputs with some margin. Specifically, its formulation is as follows arg min Ω S +,{ξ i} n i=1 1 2 trace(ω) + C n n ξ i (21) i=1 s.t. ϕ T iz i Ωϕ izi + (z i, z) ξ i ϕ T izωϕ iz, z {0, 1}, i 310 where C is a positive constant, ϕ iz = P T x i z and ϕ izi = P T x i z i. It proved to have good classification accuracy for MLC task. However, it also suffers from a big burden, i.e. it has to treat the exponentially large number of constraints for each instance during training, leading to computational infeasibility. Like MMOC, LM-kNN also adopts a Mahalanobis metric learning method for MLC, which just involves k constraints for each instance. Its distance metric learning attempts to make instances with similar class vectors closer. Thus, the class vector of each instance can be predicted by their nearest neighbors. In fact, LM-kNN can be much simpler than MMOC and is established by minimizing the following objective: arg min Ω S +,{ξ i} n i=1 1 2 trace(ω) + C n n ξ i (22) i=1 s.t. ϕ T iz i Ωϕ izi + (z i, z) ξ i ϕ T izωϕ iz, z Nei(i), i where C, ϕ iz and ϕ izi are similarly defined to those of MMOC. Nei(i) is the output set of k nearest neighbors of input x i. For LM-kNN, its prediction for a testing instance can be obtained based on its k nearest neighbors in the learned metric space. Specifically, for the testing 19

20 input x, we find its k nearest instances {x 1,..., x k } in the training set, then, a set of scores for each class vector of x can be obtained from the distances between x and {x 1,..., x k }, lastly, using these scores to predict its class vector by thresholding. Clearly, neither MMOC nor LM-kNN can be applied to our transformed problem due to that the output space of our transformed problem is not equivalent to that of MLC. Although they can be adapted to our scenario by some efforts, these efforts are non-trivial because their corresponding training and/or predicting have to be re-designed. Moreover, at present it is not our focus. As a result, we choose an alternative design way for our Mahalanobis distance metric learning, where our method is formally close to MMOC but has a closed form solution as described in section Complexity Analysis The time complexity of regularized least square regression is basically the complexity of computing the matrix multiplication with O(Nd 2 + NdL) plus the complexity of the inverse computation with O(d 3 ). The complexity of computing the geometric mean of two matrices by Cholesky-Schur method [24] is O(L 3 ). The complexity of computing a Sylvester equation is O(d 3 +L 3 +Nd 2 + N dl). The complexity of computing a box-constrained quadratic problem is O(L 3 + Ld). And the time complexity of knn is O(kN). Algorithm 1 involves solving a regularized least square regression problem and computing the geometric mean of two matrices. Therefore, its total time complexity is O(Nd 2 + NdL + d 3 + L 3 + kn). Algorithm 2 involves solving a Sylvester equation and computing the geometric mean of two matrices with η iterations (where η is usually fixed to a pre-set small integer). Therefore, its time complexity is O(η(d 3 + L 3 + Nd 2 + NdL + kn)). Algorithm 3 involves solving a box-constrained quadratic problem, its time complexity is O(L 3 +Ld). Based on [19], the training and predicting time complexities of LM-kNN are respectively O( 1 ɛ (Nd 2 +NdL+L 3 +d 3 +kndl 2 )) and O(LN +Ld), where ɛ is the accuracy met by its solution. The training and predicting time complexities 20

21 of MMOC are respectively O(θ(Nd 2 + NdL + d 3 + NL 3 + N 4 )) and O(L 3 ), where θ is its iterations. In comparison with our metric learning counterparts used here, our Algorithm 1 and Algorithm 2 have an advantage over MMOC and LM-kNN in terms of training time due to η O( 1 ɛ ) and η θ. The predicting time complexity of our Algorithm 3 is comparable to MMOC and higher than LM-kNN. 5. Experiments 355 In this section, we discuss the experiments conducted on two publicly available real-world datasets for MDC. The two datasets for MDC are respectively ImageCLEF and Bridges. ImageCLEF2014 comes from a real world challenge in the field of robot vision [25]. Bridges dataset comes from the UCI collection [26]. Unfortunately, there are not yet many publicly available standardized multi-dimensional datasets, so we boost our collections with eight most commonly used multi-label datasets which can be accessed from Mulan 2. The characteristics of these datasets are shown in Table 1. Table 1: Datasets used in the evaluation Dataset class variables Features Instances birds emotions medical scene yeast flags genbase CAL bridges CLEF We consider two commonly used evaluation criteria for MDC, namely Ham- ming accuracy and Example accuracy. These evaluation criteria can be calcu

22 Table 2: Hamming Accuracy (Part A) classifier birds emotions medical scene yeast MLKT-RR ± ± ± ± ± MLKT-gMML ± ± ± ± ± MLKT-gMML-I ± ± ± ± ± lated as follows: 1. Hamming accuracy: Acc = 1 m m Acc j = 1 m j=1 m j=1 1 N N δ(y j i, ŷj i ) i= where δ(y j i, ŷj i ) = 1 if ŷj i = yj i, and 0 otherwise. Note that ŷj i jth class value predicted by the classifier for instance i and y j i value. 2. Example accuracy Acc = 1 N N δ(y i, ŷ i ) i=1 where δ(y i, ŷ i ) = 1 if ŷ i = y i, and 0 otherwise. denotes the is its true Before the experiments, some parameters need to be set in advance. The parameter η for gmml-i algorithm is always set to 3 throughout our experiments (because when η > 3, we find it has no changes for Ω and P). The parameters λ and t associated with Ω are respectively tuned from the range {10 0, 10 1, 10 2 } and {0.3, 0.5, 0.7}. The parameter γ for P is tuned from the range {0, 0.1, 0.2}. All the following experimental results are the average results of 10-fold cross validation experiments. And, we use the notation to denote the best results Comparison with our baseline methods 375 We firstly verify the classification accuracy of MLKT-gMML-I in comparison with that of both ridge regression model (namely Ω = I) and the algorithm without iteration procedure (namely MLKT-gMML). We show the results in Tables 2, 3, 4 and 5. 22

23 Table 3: Hamming Accuracy (Part B) classifier flags genbase CAL500 ImageCLEF2014 bridges MLKT-RR ± ± ± ± ± MLKT-gMML ± ± ± ± MLKT-gMML-I ± ± ± ± ± Table 4: Example Accuracy (Part A) classifier birds emotions medical scene yeast MLKT-RR ± ± ± ± ± MLKT-gMML ± ± ± ± ± MLKT-gMML-I ± ± ± ± ± From the results, we can see that MLKT-gMML-I nearly achieves the best accuracy on all these datasets in regard to both evaluation criteria. To verify whether the differences are significant, two non-parametric Friedman tests among these methods for Hamming accuracy and Example accuracy are conducted respectively. In Hamming accuracy, the Friedman test renders a F value of (> F (α,k 1,(b 1)(k 1)) = F (0.05,2,18) = 3.555) 3. Thus, the null hypothesis that all the methods have identical effects is rejected and a post-hoc test needs to be conducted for further testing their differences. To this end, a commonly-used post-hoc test, Nemenyi test, is conducted. The result is shown in Figure 4 from which we can see that: 1) MLKT-gMML-I has a significant difference from our other two methods; 2) MLKT-gMML achieves a comparable performance with MLKT-RR. 3 Here, F is the percent point function of the F distributuion, α is the significance level, b is the number of datasets and k is the number of algorithms for test. Table 5: Example Accuracy (Part B) classifier flags genbase CAL500 ImageCLEF2014 bridges MLKT-RR ± ± ± ± ± MLKT-gMML ± ± ± ± ± MLKT-gMML-I ± ± ± ± ±

24 MLKT gmml MLKT RR MLKT gmml I Figure 4: Friedman Test of our methods in terms of Hamming accuracy. In the graph, the horizontal axis represents the values of mean rank, the vertical axis represents the different methods for test. For each method, the represents its mean rank value, the line segment represents the critical range of Nemenyi test. Two methods have a significant difference, if their line segments are not overlapped; not so, otherwise In Example accuracy, the Friedman test renders a F value of (> F (α,k 1,(b 1)(k 1) = F (0.05,2,18) = 3.555), meaning the null hypothesis is also rejected. Then, A post-hoc Nemenyi test is again conducted. Its result is shown in Figure 5 and indicates that: 1) MLKT-gMML-I is significantly different from MLKT-RR; 2) MLKT-gMML also achieves a comparable performance with MLKT-RR. On the whole, we can conclude that MLKT-gMML-I achieves the best classification performance while MLKT-gMML achieves a comparable performance with MLKT-RR. Therefore, in the following, we just concentrate on the comparison between MLKT-gMML-I and the other competitive MDC methods Comparison with several competitive MDC methods 405 We then compare MLKT-gMML-I with several competitive methods for MD- C from the literature: Binary-Relevance (BR), Classifier Chains (CC), Ensemble of Classifier Chains (ECC), RAkEL and Super-Class Classifier (SCC). S- ince the above methods are only designed for modeling output structure, Naive Bayesian classifier is used as their base classifier in our experiments. We use an open-source Java framework, namely the MEKA [27] library, for the experiments. Regarding the parameterization of these approaches, ECC is 24

25 MLKT RR MLKT gmml MLKT gmml I Figure 5: Friedman Test of our methods in terms of Example accuracy. In the graph, the horizontal axis represents the values of mean rank, the vertical axis represents the different methods for test. For each method, the represents its mean rank value, the line segment represents the critical range of Nemenyi test. Two methods have a significant difference, if their line segments are not overlapped; not so, otherwise. Table 6: Hamming Accuracy (Part A) classifier birds emotions medical scene yeast BR-NB ± ± ± ± ± CC-NB ± ± ± ± ± ECC-NB ± ± ± ± ± RAkEL ± ± ± ± ± SCC ± ± ± ± ± MLKT-gMML-I ± ± ± ± ± configured to learn 10 different models for the ensemble, for RAkEL we use the recommended configuration with 2m models having triplets of class combinations, and for SCC we use a nearest neighbour replacement filter (NNR) to identify all p = 1 infrequent class-values and replace them with their n = 2 mostfrequent nearest neighbours. Their Hamming accuracy and Example accuracy are shown in Tables 6, 7, 8 and 9 respectively. From the results of these tables, we see that MLKT-gMML-I can achieve better performance on most of the datasets than its competitive MDC methods (BR, CC, ECC, RAkEL and SCC) in terms of both evaluation criteria. To verify the performance differences, two non-parametric Friedman tests among these methods for Hamming accuracy and Example accuracy are respectively conducted. 25

26 Table 7: Hamming Accuracy (Part B) classifier flags genbase CAL500 ImageCLEF2014 bridges BR-NB ± ± ± ± ± CC-NB ± ± ± ± ± ECC-NB ± ± ± ± ± RAkEL ± ± ± ± ± SCC ± ± ± ± MLKT-gMML-I ± ± ± ± ± Table 8: Example Accuracy (Part A) classifier birds emotions medical scene yeast BR-NB ± ± ± ± ± CC-NB ± ± ± ± ± ECC-NB ± ± ± ± ± RAkEL ± ± ± ± ± SCC ± ± ± ± ± MLKT-gMML-I ± ± ± ± ± Table 9: Example Accuracy (Part B) classifier flags genbase CAL500 ImageCLEF2014 bridges BR-NB ± ± ± ± ± CC-NB ± ± ± ± ± ECC-NB ± ± ± ± ± RAkEL ± ± ± ± ± SCC ± ± ± ± ± MLKT-gMML-I ± ± ± ± ±

27 In Hamming accuracy, the Friedman test renders a F value of (> F (α,k 1,(b 1)(k 1)) = F (0.05,9,45) = 2.422). Thus, the null hypothesis that all the methods are identical is rejected and a post-hoc test needs to be conducted for further testing their differences. To this end, a commongly-used post-hoc test, Nemenyi test, is conducted. The results is shown in Figure 6, from which we can see that: 1) MLKT-gMML-I has a significant difference from two methods (BR- NB and CC-NB); 2) There is no significant difference among these methods except MLKT-gMML-I. Therefore, MLKT-gMML-I achieves a slightly better classification performance than its competitive MDC methods. BR NB CC NB ECC NB RKkEL SCC MLKT gmml I Figure 6: Friedman Test in terms of Hamming accuracy. In the graph, the horizontal axis represents the values of mean rank, the vertical axis represents the different methods for test. For each method, the represents its mean rank value, the line segment represents the critical range of Nemenyi test. Two methods have a significant difference, if their line segments are not overlapped; not so, otherwise In Example accuracy, the Friedman test renders a F value of (> F (α,k 1,(b 1)(k 1)) = F (0.05,10,45) = 2.422). Thus, the null hypothesis that all the methods are identical is rejected and a post-hoc test needs to be conducted for further testing their differences. To this end, the Nemenyi test as above is conducted. The results is shown in Figure 6 from which we can see that: 1) MLKT-gMML-I has a significant difference from BR-NB. 2) There is not a significant difference among these methods except BR-NB. Therefore, MLKTgMML-I can achieve a comparable Example accuracy with its competitive methods. On the whole, we can conclude that MLKT-gMML-I achieves comparable 27

28 BR NB CC NB ECC NB RAkEL SCC MLKT gmml I Figure 7: Friedman Test in terms of Example accuracy. In the graph, the horizontal axis represents the values of mean rank, the vertical axis represents the different methods for test. For each method, the represents its mean rank value, the line segment represents the critical range of Nemenyi test. Two methods have a significant difference, if their line segments are not overlapped; not so, otherwise. (or even slightly better) classification performance with (or than) its competitive MDC methods Comparison with LM-kNN on MLC task Note that gmml-i is closely related to MMOC and LM-kNN and the latter two are designed for MLC specially, thus we just conduct experiments on MLC datasets to verify their classification performance. However, since MMOC has to deal with exponentially large number of constraints for each instance in training procedure, it is infeasible even for the CAL500 dataset with 68 features and 174 labels [19]. Therefore, we only compare gmml-i with LM-kNN. We show the results in figure 8 and figure 9. We can see from the figures that gmml-i achieves better performance on six datasets in terms of Hamming accuracy, while four datasets in terms of Example accuracy 4 than LM-kNN. So on the whole, gmml-i achieves better classification on most of the datasets. To verify their difference, the Friedman tests of differences between gmml-i and LM-kNN are conducted and render F -values of for Hamming accuracy and for Example accuracy re- 4 Both methods achieve zero accuracy on CAL500 for Example accuracy. 28

29 gmml I LM knn birds emotions medical scene yeast flags genbase CAL500 Figure 8: Hamming Accuracy (HA) gmml I LM knn birds emotions medical scene yeast flags genbase CAL500 Figure 9: Example Accuracy (EA). 29

30 460 spectively, both are not significant (< F (α,k 1,(b 1)(k 1)) = F (0.05,7,7) = 5.595). So, gmml-i can obtain competitive classification accuracy on MLC task with LM-kNN, but has a lower learning complexity than LM-kNN as analysed in section Conclusions In this paper, we proposed a new transformation approach, namely MLKT, for MDC, which possesses the following favorable characteristics: i) it can keep the space size of MDC invariant, ii) it can reflect the explicit within-dimensional relationships, iii) it is easy for subsequent modeling in the transformed space, iv) it can overcome the class overfitting and class imbalance problems suffered by LP-based transformation approach, v) it is decomposable for each output dimension of MDC. Moreover, we also presented a novel metric learning based method for the transformed problem, which itself can be of independent interest and also has a closed form solution. Extensive experimental results justified that our approach combined the above two procedures can achieve better classification performance than the competitive MDC methods, while our metric learning based method itself can also obtain competitive classification performance with a lower learning complexity compared to its counterparts designed specifically for MLC. And, as mentioned in the introduction section, we can refer to many MLC methods to develop alternatives well suited to our transformed problem as our future direction. Acknowledgements 480 This work is supported in part by the National Natural Science Foundation of China under the Grant Nos and in part by the Funding of Jiangsu Innovation Program for Graduate Education under Grant KYLX And we would like to express our appreciation for the valuable comments from reviewers and editors. 30

Fantope Regularization in Metric Learning

Fantope Regularization in Metric Learning Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction

More information

Progressive Random k-labelsets for Cost-Sensitive Multi-Label Classification

Progressive Random k-labelsets for Cost-Sensitive Multi-Label Classification 1 26 Progressive Random k-labelsets for Cost-Sensitive Multi-Label Classification Yu-Ping Wu Hsuan-Tien Lin Department of Computer Science and Information Engineering, National Taiwan University, Taiwan

More information

Multi-label learning from batch and streaming data

Multi-label learning from batch and streaming data Multi-label learning from batch and streaming data Jesse Read Télécom ParisTech École Polytechnique Summer School on Mining Big and Complex Data 5 September 26 Ohrid, Macedonia Introduction x = Binary

More information

A Deep Interpretation of Classifier Chains

A Deep Interpretation of Classifier Chains A Deep Interpretation of Classifier Chains Jesse Read and Jaakko Holmén http://users.ics.aalto.fi/{jesse,jhollmen}/ Aalto University School of Science, Department of Information and Computer Science and

More information

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification Robin Senge, Juan José del Coz and Eyke Hüllermeier Draft version of a paper to appear in: L. Schmidt-Thieme and

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss

Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss Krzysztof Dembczyński 1, Willem Waegeman 2, Weiwei Cheng 1, and Eyke Hüllermeier 1 1 Knowledge

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

More information

Riemannian Metric Learning for Symmetric Positive Definite Matrices

Riemannian Metric Learning for Symmetric Positive Definite Matrices CMSC 88J: Linear Subspaces and Manifolds for Computer Vision and Machine Learning Riemannian Metric Learning for Symmetric Positive Definite Matrices Raviteja Vemulapalli Guide: Professor David W. Jacobs

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

Kernel Density Topic Models: Visual Topics Without Visual Words

Kernel Density Topic Models: Visual Topics Without Visual Words Kernel Density Topic Models: Visual Topics Without Visual Words Konstantinos Rematas K.U. Leuven ESAT-iMinds krematas@esat.kuleuven.be Mario Fritz Max Planck Institute for Informatics mfrtiz@mpi-inf.mpg.de

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Optimization of Classifier Chains via Conditional Likelihood Maximization

Optimization of Classifier Chains via Conditional Likelihood Maximization Optimization of Classifier Chains via Conditional Likelihood Maximization Lu Sun, Mineichi Kudo Graduate School of Information Science and Technology, Hokkaido University, Sapporo 060-0814, Japan Abstract

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

Classifier Chains for Multi-label Classification

Classifier Chains for Multi-label Classification Classifier Chains for Multi-label Classification Jesse Read, Bernhard Pfahringer, Geoff Holmes, Eibe Frank University of Waikato New Zealand ECML PKDD 2009, September 9, 2009. Bled, Slovenia J. Read, B.

More information

TDT4173 Machine Learning

TDT4173 Machine Learning TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Interpreting Deep Classifiers

Interpreting Deep Classifiers Ruprecht-Karls-University Heidelberg Faculty of Mathematics and Computer Science Seminar: Explainable Machine Learning Interpreting Deep Classifiers by Visual Distillation of Dark Knowledge Author: Daniela

More information

Sparse Approximation and Variable Selection

Sparse Approximation and Variable Selection Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Classification and Pattern Recognition

Classification and Pattern Recognition Classification and Pattern Recognition Léon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Goals Representation Capacity Control Operational Considerations Computational Considerations

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Support Vector Ordinal Regression using Privileged Information

Support Vector Ordinal Regression using Privileged Information Support Vector Ordinal Regression using Privileged Information Fengzhen Tang 1, Peter Tiňo 2, Pedro Antonio Gutiérrez 3 and Huanhuan Chen 4 1,2,4- The University of Birmingham, School of Computer Science,

More information

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Pengtao Xie, Machine Learning Department, Carnegie Mellon University 1. Background Latent Variable Models (LVMs) are

More information

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS Filippo Portera and Alessandro Sperduti Dipartimento di Matematica Pura ed Applicata Universit a di Padova, Padova, Italy {portera,sperduti}@math.unipd.it

More information

Large-Margin Thresholded Ensembles for Ordinal Regression

Large-Margin Thresholded Ensembles for Ordinal Regression Large-Margin Thresholded Ensembles for Ordinal Regression Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology, U.S.A. Conf. on Algorithmic Learning Theory, October 9,

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Machine Learning, Midterm Exam

Machine Learning, Midterm Exam 10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Günther Eibl and Karl Peter Pfeiffer Institute of Biostatistics, Innsbruck, Austria guenther.eibl@uibk.ac.at Abstract.

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

An Ensemble of Bayesian Networks for Multilabel Classification

An Ensemble of Bayesian Networks for Multilabel Classification Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence An Ensemble of Bayesian Networks for Multilabel Classification Antonucci Alessandro, Giorgio Corani, Denis Mauá,

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Midterm Exam, Spring 2005

Midterm Exam, Spring 2005 10-701 Midterm Exam, Spring 2005 1. Write your name and your email address below. Name: Email address: 2. There should be 15 numbered pages in this exam (including this cover sheet). 3. Write your name

More information

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes Mixtures of Gaussians with Sparse Regression Matrices Constantinos Boulis, Jeffrey Bilmes {boulis,bilmes}@ee.washington.edu Dept of EE, University of Washington Seattle WA, 98195-2500 UW Electrical Engineering

More information

CSE 546 Final Exam, Autumn 2013

CSE 546 Final Exam, Autumn 2013 CSE 546 Final Exam, Autumn 0. Personal info: Name: Student ID: E-mail address:. There should be 5 numbered pages in this exam (including this cover sheet).. You can use any material you brought: any book,

More information

ECE 592 Topics in Data Science

ECE 592 Topics in Data Science ECE 592 Topics in Data Science Final Fall 2017 December 11, 2017 Please remember to justify your answers carefully, and to staple your test sheet and answers together before submitting. Name: Student ID:

More information

Final Exam, Fall 2002

Final Exam, Fall 2002 15-781 Final Exam, Fall 22 1. Write your name and your andrew email address below. Name: Andrew ID: 2. There should be 17 pages in this exam (excluding this cover sheet). 3. If you need more room to work

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn CMU 10-701: Machine Learning (Fall 2016) https://piazza.com/class/is95mzbrvpn63d OUT: September 13th DUE: September

More information

Mixtures of Gaussians with Sparse Structure

Mixtures of Gaussians with Sparse Structure Mixtures of Gaussians with Sparse Structure Costas Boulis 1 Abstract When fitting a mixture of Gaussians to training data there are usually two choices for the type of Gaussians used. Either diagonal or

More information

Introduction to Logistic Regression

Introduction to Logistic Regression Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 5: Multi-class and preference learning Juho Rousu 11. October, 2017 Juho Rousu 11. October, 2017 1 / 37 Agenda from now on: This week s theme: going

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Approximating the Covariance Matrix with Low-rank Perturbations

Approximating the Covariance Matrix with Low-rank Perturbations Approximating the Covariance Matrix with Low-rank Perturbations Malik Magdon-Ismail and Jonathan T. Purnell Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 {magdon,purnej}@cs.rpi.edu

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Robotics 2 AdaBoost for People and Place Detection

Robotics 2 AdaBoost for People and Place Detection Robotics 2 AdaBoost for People and Place Detection Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Wolfram Burgard v.1.0, Kai Arras, Oct 09, including material by Luciano Spinello and Oscar Martinez Mozos

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 7/8 - High-dimensional modeling part 1 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Classification

More information

Measuring Discriminant and Characteristic Capability for Building and Assessing Classifiers

Measuring Discriminant and Characteristic Capability for Building and Assessing Classifiers Measuring Discriminant and Characteristic Capability for Building and Assessing Classifiers Giuliano Armano, Francesca Fanni and Alessandro Giuliani Dept. of Electrical and Electronic Engineering, University

More information

Learning Binary Classifiers for Multi-Class Problem

Learning Binary Classifiers for Multi-Class Problem Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,

More information

3 : Representation of Undirected GM

3 : Representation of Undirected GM 10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

Linear Classifiers. Michael Collins. January 18, 2012

Linear Classifiers. Michael Collins. January 18, 2012 Linear Classifiers Michael Collins January 18, 2012 Today s Lecture Binary classification problems Linear classifiers The perceptron algorithm Classification Problems: An Example Goal: build a system that

More information

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

More information

A Simple Algorithm for Multilabel Ranking

A Simple Algorithm for Multilabel Ranking A Simple Algorithm for Multilabel Ranking Krzysztof Dembczyński 1 Wojciech Kot lowski 1 Eyke Hüllermeier 2 1 Intelligent Decision Support Systems Laboratory (IDSS), Poznań University of Technology, Poland

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels

Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels Daoqiang Zhang Zhi-Hua Zhou National Laboratory for Novel Software Technology Nanjing University, Nanjing 2193, China

More information

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are

More information

Kernel Methods & Support Vector Machines

Kernel Methods & Support Vector Machines Kernel Methods & Support Vector Machines Mahdi pakdaman Naeini PhD Candidate, University of Tehran Senior Researcher, TOSAN Intelligent Data Miners Outline Motivation Introduction to pattern recognition

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

arxiv: v1 [stat.ml] 10 Dec 2015

arxiv: v1 [stat.ml] 10 Dec 2015 Boosted Sparse Non-linear Distance Metric Learning arxiv:1512.03396v1 [stat.ml] 10 Dec 2015 Yuting Ma Tian Zheng yma@stat.columbia.edu tzheng@stat.columbia.edu Department of Statistics Department of Statistics

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury

Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury Metric Learning 16 th Feb 2017 Rahul Dey Anurag Chowdhury 1 Presentation based on Bellet, Aurélien, Amaury Habrard, and Marc Sebban. "A survey on metric learning for feature vectors and structured data."

More information

Machine Learning, Fall 2011: Homework 5

Machine Learning, Fall 2011: Homework 5 0-60 Machine Learning, Fall 0: Homework 5 Machine Learning Department Carnegie Mellon University Due:??? Instructions There are 3 questions on this assignment. Please submit your completed homework to

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Undirected graphical models

Undirected graphical models Undirected graphical models Semantics of probabilistic models over undirected graphs Parameters of undirected models Example applications COMP-652 and ECSE-608, February 16, 2017 1 Undirected graphical

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Kernels for Multi task Learning

Kernels for Multi task Learning Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2015 Announcements TA Monisha s office hour has changed to Thursdays 10-12pm, 462WVH (the same

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT) Metric Embedding of Task-Specific Similarity Greg Shakhnarovich Brown University joint work with Trevor Darrell (MIT) August 9, 2006 Task-specific similarity A toy example: Task-specific similarity A toy

More information

SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES

SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES JIANG ZHU, SHILIANG SUN Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 20024, P. R. China E-MAIL:

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Sparse representation classification and positive L1 minimization

Sparse representation classification and positive L1 minimization Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10 COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem

More information