Transfer Ordinal Label Learning

Size: px
Start display at page:

Download "Transfer Ordinal Label Learning"

Transcription

1 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX Transfer Ordinal Label Learning Chun-Wei Seah, Ivor W. Tsang, Yew-Soon Ong Abstract Designing a classifier in the absence of labeled data is becoming a common encounter as the acquisition of informative labels is often difficult or expensive, particularly on new uncharted target domains. The feasibility of attaining a reliable classifier for the task of interest has been embarked by some in transfer learning, where label information from relevant source domains has been considered for complimenting the design process. The core challenge arising from such endeavors, however, is the induction of source sample selection bias, such that the trained classifier has the tendency of steering towards the distribution of the source domain. And this bias is deemed to become more severe on data involving multiple classes. Taking this cue, our interest in this paper is to address such a challenge in the target domain, where ordinal labeled data are unavailable. In contrast to previous works, we propose a Transfer Ordinal Label Learning or TOLL paradigm to predict the ordinal labels of target unlabeled data by spanning the feasible solution space with ensemble of ordinal classifiers from multiple relevant source domains. Specifically, the maximum margin criterion is considered here for the construction of the target classifier from an ensemble of source ordinal classifiers. Theoretical analysis and extensive empirical studies on real-world datasets are presented to study the benefits of the proposed method. Index Terms Transfer Learning, Domain Adaptation, Ordinal Regression, Source Sample Selection Bias, Classifier Selection I. INTRODUCTION To date, many practical realizations of machine intelligence are making their way as important tools that assist humans in their decision making process. A motivating example is sentiment rating prediction on user reviews as a tool for crafting novel marketing strategies on newly launched products (referred to as the target domain). Each user review can then be categorized into different star-ratings (often represented as ordinal labels in machine classification), where a higher starrating indicates a better feedback on the product. In practice, most newly launched products have many user comments posted on the Internet. However, it is usually the case that few of such comments are readily tagged with sentiment star-rating labels. To address the absence of such label information, the field of Domain Adaptation (DA) learning has embarked the feasibility study on classifier for new target domains using the available label information of other related source domains. The initial work of DA as proposed in [], presented a study involving the use of single related source domain that shares common joint distribution with the target domain of interest. Manuscript received December 3, 202; revised March 2, 203 and June 6, 203; accepted June 0, 203. This research is partially supported by Multi-plAtform Game Innovation Centre (MAGIC) in Nanyang Technological University. MAGIC is funded by the Interactive Digital Media Programme Office (IDMPO) hosted by the Media Development Authority of Singapore. The authors are with School of Computer Engineering, Nanyang Technological University, Singapore , SChunWei@dso.edu.sg,{IvorTsang,asYSOng}@ntu.edu.sg Subsequent works have moved on to relax the strict common joint distribution assumption. In particular, dissimilarity in the marginal distributions among domains has been established as covariate shift [2 5]. To date, a common remedy to address the covariate shift issue is by means of instance re-weighting [6 0], where the weight of each source data sample vector is defined according to the density-ratio of the target P t (x) to source P s P (x) marginal distributions, i.e., t P s (x). In this manner, the dissimilarities between the source and target domains are modeled with different marginal distribution P (x), while the similarity of predictive distributions P (y x) across different domains are preserved. The Kernel-Mean Matching (KMM) method [7], for instance, considers first an estimate on the weight of each source sample vector by minimizing the Maximum Mean Discrepancy (MMD) [] between the source labeled samples and the target unlabeled samples. The re-weighted source samples are subsequently used for training the target classifier. More recently, a unified framework of the density-ratio estimation based on the Bregman divergence has been proposed [2], and included KMM as one of its variants. Another popular scheme in the DA field is to seek for an appropriate feature representation of the source domain that corresponds well to the feature space of the target domain [3 9]. For instance, by minimizing the MMD between the source and target samples, Transfer Component Analysis (TCA) [8] identifies a suitable latent space spanned by some basis vectors, referred to as the transfer components. In recent years, many advancing DA methods have broadened their scopes to consider leveraging the label information from multiple relevant source domains. For instance, by treating every source domain equally, Multiple Convex Combinations (MCC) formulates the target classifier as a fusion of multiple Support Vector Machines (SVM) classifiers that are learned from the individual relevant source domains [20, 2]. However, it is worth noting that a simple and direct compilation of all data in the source domains to complement the target learning task can lead to adverse outcomes [22], especially when the classifier learned from the source data fails to serve as discriminative on the target data. As such, an extension of the MCC, labeled as the Domain Adaptation Machine (DAM) in [23], was subsequently proposed where prior knowledge on the source and target domains is incorporated to define the importance of each source classifier. More importantly, it is worth highlighting that in general, all DA methods train the target classifier by minimizing the empirical risk defined based only on the source data or its weighted samples. With such a design process, the classifier is likely to exhibit properties that are steered towards the distribution of the source domain and this inevitably induces biases in the resultant prediction, thus potentially leading

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX 2 to poor accuracy on the prediction of the target unseen data. Particularly, we consider the study of transfer ordinal label learning since the bias is expected to be more severe when multiple classes are involved [24, 25]. We refer the phenomenon here as the source sample selection bias. To alleviate the source sample selection bias, it is generally advisable to directly minimize the expected risk functional defined only on the target data, for example, by means of leveraging from any prior knowledge that may be available on the output label structure of the target domain. An intuitive solution is nonetheless to group the target unlabeled samples via an unsupervised learning paradigm, subjected to some criteria imposed, such as the cases of Maximum Margin Clustering (MMC) and Maximum Margin Context described in [29] and [30], respectively. Particularly, MMC maximizes the margin between opposite clusters by considering all possible combinations of labels on the target unlabeled samples. To be specific, MMC optimizes the labels of u unlabeled samples from c u unique label combinations for a c class problem. MMC, however, has its limits. In not taking the class structure such as the abundance of label information that are readily available in the related source domains (for instance, ordinal class labels in the context of ordinal regression), the tendencies of under-performing DA methods in general are high. Besides, the approach may sometimes lead to trivial solutions, such as the case where all samples are grouped with the same class label and hence deemed as futile [29, 3]. In this paper, our interest lies in addressing the challenges pertaining to source sample selection bias in the absence of target labeled data. In contrast to existing DA works and MCC, we proposed a novel Transfer Ordinal Label Learning approach or TOLL in short, which imposes the maximum margin criterion on the target unlabeled data in the process of constructing the target classifier from an ensemble of source ordinal classifiers. Here, this paper assumes the source and target domains to share the same tasks 2. In the absence of target labeled data, it is reasonable to assume that the feasible solution space of the target ordinal labels can be spanned by a series of source ordinal classifiers. The core contributions of the present paper are summarized as follows: ) Existing DA methods that seek for instance reweighting or appropriate feature representation have to date only taken the marginal distribution differences between source and target domains into considerations. Furthermore, it has been established that the effects of source sample selection bias become more severe and challenging in the context of ordinal problems. Despite the advancements on DA approaches, to date none has considered making use of ordinal information in their framework as means to improve ordinal predictions, mainly because the transfer of output structures from source to target domains is a non-trivial task. To the best of our knowledge, this paper thus presents the first Note that sample selection bias is well-known in econometrics [26, 27] and in dataset shift [28], and covariate shift is considered as one of its variant [9]. 2 In the event where the source and target domains originate from different tasks, the reader is then referred to [32 34]. DA work that embarks an investigation on the issues pertaining to source sample selection bias under the challenging context of ordinal regression. Particularly, TOLL learns the ordinal labels of the target unlabeled data from a convex hull of the ordinal outputs that are predicted by multiple source classifiers, namely the label vectors. 2) We present the generalization absolute error bound for ordinal regression in the target domain. Our analysis shows that, when the target unlabeled data follows the cluster assumption [35, 36] well, the classifier with a large target margin can reduce this error bound. In the experimental study of the sentiment classification application, the results manifest that the ensemble source ordinal classifiers with a larger target margin is associated with a smaller testing absolute error in the target domain. This verifies the appropriateness and effectiveness in choosing discriminative source classifiers for ordinal regression in the DA setting. 3) Furthermore, our extensive experimental studies highlight that TOLL emerged as superior to several state-ofthe-art DA methods in most of the tasks considered, and is robust to various settings of differing class distribution ratios between the source and target domains. The rest of this paper is organized as follows: Section II gives the preliminaries and a brief review on ordinal regression. Section III introduces the formulation of TOLL and implementation details. Extensive experiments on Sentiment, Newsgroup and datasets are then carried out in Section IV. The experimental results are then analyzed and discussed in Sections V. Lastly, the conclusive remarks of this paper are drawn in Section VI. A preliminary work of TOLL can be found in [37] and this paper serves as a significant add on, which includes but is not limited to, the extension to ordinal regression, derivation of generalization absolute error bound and experimental study on ordinal regression problems. II. PRELIMINARIES AND REVIEW OF ORDINAL REGRESSION In this section, the notation symbols used in the present manuscript and a brief review of the extended binary classification model for ordinal regression are presented. A. Notations Throughout the rest of this paper, a superscript denotes the transpose of a vector or a matrix, denotes the element-wise product operator, I[ ] denotes an indicator function that returns a if the predicate holds, otherwise a zero is returned, and sign( ) is a function that returns if the input is negative; otherwise, + is returned. Moreover, defines a vector with all ones. Given m source domains and one target domain X u, which contains u unlabeled (testing) samples, x j R p, the task in Domain Adaptation (DA) is to leverage from the available labeled data in relevant source domains, to predict the class label ŷ j {, 2,..., K} of each unlabeled sample in the target domain involving a K ordinal class problem. In addition, a K ordinal class problem is represented by K

3 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX 3 Source Domain Precomputed Classifiers Source Domain m Fig.. Unlabeled data Step : Generate target label vectors (see Algo. ) Generate Target Label Space Unlabeled data in Target Domain Target label space Unlabeled data Step 2: Transfer Ordinal Label Learning (see Algo. 2) Transfer Ordinal Label Learning Framework Labels of target unlabeled data Learning the Ordinal Labels of Target Unlabeled Data ordered thresholds: θ θ 2... θ where θ 0 = and θ K =. A predictive output h(x) of sample x that falls between θ k h(x) θ k is thus classified as class k. B. Extended Binary Classification Model for Ordinal Regression In this subsection, we briefly outline an extended binary classification model that has showcased state-of-the-art performances for ordinal regression [38, 39] 3. An ordinal labeled samples (x, y) can be extended to K binary samples in SVM algorithm via the following transformation: x k = (x, e k ) R p+, y k = 2I[y k], for k =, 2,..., K, where e k R denotes a vector with the kth element being one, while the rest of the elements are zero. As an extended binary sample has a dimension of (p + K ), the weighted vector w of SVM is also augmented to become (w, θ), which is used to give the binary predictive value of x k as: () f(x k ) = sign((w, θ) x k ) = sign(h(x) θ k ), (2) where h(x) = w x. Using (2), the predictive class label of sample x is then given as follows: k= I[f(x k ) = ] +. (3) III. THE PROPOSED TRANSFER ORDINAL LABEL LEARNING Figure depicts the learning process of the proposed TOLL framework. Without loss of generality, source classifiers are first trained for each unique combination of source domains. The source classifiers can be trained using any DA method that is readily available. Note that the source classifier can even be precomputed so as to preserve the interests of a company, such as the privacy and security of customer data. In TOLL, the relevancy and specificity of each source classifier is then learned with respect to the target domain. In particular, TOLL alleviates the presence of any unwanted sample selection bias that may exist by learning the biases of each source classifier, based on prior knowledge available on the output 3 Note that a very similar idea was previously presented in [40]. label structure of the target domain. All these source classifiers with different biases are subsequently used to span the target label space (see Sec. III-A). Once the target label space is formed, TOLL proceeds to simultaneously learn the weight of each source classifier and the target classifier for the domain of interest, in a manner where the margin of separation in the target label space is maximized (see Sec. III-B). A. Generating Target Label Space from Multiple Sources Using the complimentary labeled data from multiple relevant source domains, appropriate target classifier can be derived from an ensemble of source classifiers for the purpose of target unlabeled data prediction. In what follows, the procedure to generate the label space for a given set of target unlabeled data, referred to as target label space, is discussed. An outline of the procedure is summarized in Algorithm. Given the availability of m source domains, the design process begins with the construction of a classifier in each source domain and also a classifier for each combination of 2, 3,..., (m ) source domains, until S possible combinations of the m source domains have been explored, i.e., S = m m! i= i!(m i)! classifiers are trained. Note that diverse forms of source classifiers can be trained, either based on SVM, Gaussian Process [4], Transductive SVM [35] or any other variants of supervised, semi-supervised or DA methods. Without loss of generality, we consider supervised SVM in the present manuscript. Like most models, each of the S source classifiers includes a bias term b such that the decision boundary is not restricted to intersect only at the origin. TOLL leverages from the biases of source classifiers to generate label vectors y = [y,..., y (),..., yu,..., y u () ] for the target unlabeled data, where [yi,..., y() i ] {, } denote the extended class labels of the ith sample. Since the source classifiers may be trained from source domains that are of differing distributions to the target domain, it is more beneficial to determine the bias b based on the target data. Hence, we propose to define the bias b of source classifiers in such a way where the label vector y of the target unlabeled data satisfies the following balance constraint: u K ( β) [( u i= I ) j I[y j i = ] + = k ( + β), k {,..., K}, u K where β is the hyper-parameter to restrict the imbalanced class size q k for the kth extended class label in the label vector. This constraint can be implicitly imposed by sorting the classifier s decision outputs of the target unlabeled data and forms at most Z = ( u K 2β) unique label vectors. Hence, the target label space is spanned by S Z label vectors. With the S Z label vectors, the target label space, M, is then defined as follows: M= {ŷ = S Z s= z= gs zyz s S Z s= z= gs z = ; gz s 0, u K ( β) u i= I[( j I[y sj zi = ] + ) = k] u K (+β) k =,.., K z =,..., Z, s =,..., S}, (4) ]

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX 4 Algorithm Generation of the target label space : Inputs: F (a set of source classifiers (precomputed) trained from each unique combination of source domains), β controls the imbalance of label vectors 2: Outputs: Y (a set of generated label vectors for target unlabeled data) 3: for all f s F do 4: indexes=sort(f s (x i ),..., f s (x u )) 5: z =, q 0 = ; 6: for each unique set of {q,, q K }, where u K ( β) q k u K ( + β), K k= q k = u, k =,..., K do 7: create a vector yz s R u() 8: for C =,..., K do 9: assign yz s from the C k=0 q kth index to C k= q kth index as the extended class label C 0: end for : Y = Y yz; s z = z + ; 2: end for 3: end for 4: return Y where the importance of each source classifier, yz, s is weighted by gz, s and without loss of generality, the extended binary class labels of X u is denoted by y = [y,..., y (),..., yu,..., y u () ]. In addition, M forms the convex hull of the target output label space [42]. B. Proposed Formulation To alleviate the source sample selection bias, we propose the minimization of the expected risk by taking only the target unlabeled samples into consideration. Particularly in TOLL, learning the labels of the unlabeled samples is conducted by minimizing the following structural risk using hinge loss function of SVM: min ŷ M { min w,ρ,ξ 2 w θ 2 2 ρ + C u k= i= s.t. ŷi k ϕ(x (w i ) θ k ) ρ ξi k, ξk i 0 i =,..., u, k =,..., K, } θ k θ k+ k =,..., K, where ϕ(x) maps x into a high dimensional space, ŷi k [+, ]. w ϕ(x) is the predictive function, ρ is the maximum error allowable before ξ i (the slack variable) is penalized, and C denotes the regularization parameter that tradeoffs between model complexity and empirical risk. Since the hinge loss employed in the inner minimization (i.e., enclosed by {} in (5)) is non-increasing, the ordered constraints on θ θ 2... θ are implicitly fulfilled (see the proof of Theorem 2 in [38]). With the outer minimization of (5) over ŷ, the optimal decision function w ϕ(x) is essentially the solutions with decision boundaries lying in the low density regions of the target unlabeled data [36]. Furthermore, TOLL learns the weight of each label vector yz s (as predicted by a source classifier) in (5) by minimizing the structural risk involving the ξ k i (5) target samples only. In this manner, the kernel expansion of the target classifier will only be defined by data samples in the target domain. Note that in the event that some target labeled data do exist, such information can be easily incorporated into TOLL by simply imposing the labels of the target labeled data for the available y s z. C. Optimization in TOLL In what follows, the detailed steps to solve (5) of TOLL are presented. First, the Lagrangian of the inner minimization in (5), enclosed by {}, can be written as follows: L = 2 w θ 2 2 ρ + C u i= k= u i= k= αk i (yk i ϕ(x (w i ) θ k ) ρ + ξi k u ) λ k i ξi k. i= k= where αi k 0 and λk i 0 are the Lagrangian multipliers of the inequality constraints. According to the KKT condition, we have: w = u i= k= αk i yk i ϕ(x i), (7) ξ k i (6) θ k = u i= αk i yk i, (8) C = α k i + λk i, (9) u i= k= αk i =. (0) Substituting (7), (8), (9) and (0) back into (6), we have max α u 2 i,j= k,k = αk i αk j yk i yk j (K(xk i, xk j )). where K(x k i, xk j ) = ϕ(x i)ϕ(x j ) + I[k = k ]. We further define α = {α,..., α,..., αu,..., αu }, and A = {α u i= k= αk i =, 0 αi k C, i =,..., u, k =,..., K }, then (5) is simplified as follows: { } max (K ŷŷ )α. α A 2 α () min ŷ M Since A and M are both compact sets and according to the minimax theorem [43], swapping the order of the min and max in () is equivalent to: max α A max α A { min ŷ M 2 α (K ŷŷ )α In addition, (2) can be reformulated as: { max Ψ, Ψ s.t. Ψ 2 α (K y t y t)α, y t M }. (2) }. (3) Moreover, the dual form of the inner maximization of (3) is: { max min α A d D 2 α ( ) } d t K y t y t α, (4) t:y t M where d denotes a vector of Lagrangian multipliers d t and D = {d t:y t M d t =, d t 0 t : y t M} is the domain

5 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX 5 Algorithm 2 Transfer Ordinal Label Learning (TOLL) : Inputs: M 2 (a set of source label vectors generated by Algorithm ) 2: α = u, then find the most violated y t in (6) and let S = {y t } 3: repeat 4: Find optimal d S and α in (5) via MKL 5: Find the most violated y t by (6) and set S = S y t 6: until convergence 7: return d t, y t t : y t S of d. Since D and A are both compact sets, swapping the order of the max and min in (4) is equivalent to: { min max d D α A 2 α ( ) } d t K y t y t α. (5) t:y t M Note that the set M in (5) corresponds to the base kernels of the Multiple Kernel Learning (MKL) problem [44]. Hence, (5) can be solved using the efficient MKL solvers [45]. In the presence of a significant number of source classifiers, solving by MKL may not be efficient. Fortunately, as it is unlikely for all of the constraints in (5) to be active simultaneously at the optimal solution, the efficient cutting plane method can be efficiently deployed [46] in solving (5) (see Algorithm 2). The algorithm begins with the initialization of α = u and then locates the most violated constraint of (6) that also fails the constraint in (3). Theorem. The most violated constraint of (3) for a fixed α is then: arg max (K yy )α, 2 α (6) where M 2 = {y,..., yz,..., ys,..., yz S } y M 2 Proof: Let f(y) = 2 (K yy )α. Since f() is a α convex function, f(( λ)y i + λy j ) ( λ)f(y i ) + λf(y j ), y i, y j M 2, λ [0, ] according to the convexity property. If the predicate f(y i ) > f(y j ), then f(( λ)y i + λy j ) f(y i ). Similarity, if f(y i ) < f(y j ), then f(( λ)y i + λy j ) f(y j ). Therefore, the predicate f(( λ)y i +λy j ) max(f(y i ), f(y j )) holds. By induction [42], f(λ y λ Z y Z λs y S λ S Z ys Z ) (arg max y M2 f(y)) given S Z s= z= λs z = and λ s z [0, ]. Note that to solve (6), no numerical optimization solver is needed since the maximum objective value is simply obtained by computing all the objective values in the set M 2 and then the most violated y t corresponds to that with the highest value among those computed. Hence, the first active constraint is chosen based on the most violated y t. Thereafter, the current set of selected constraints are solved via MKL before obtaining the next most violated constraint for inclusion into the set of constraints. The process of finding the next most violated constraint is repeated until convergence. Empirically, only a few iterations is needed for Algorithm 2 to converge. The overall time complexity of TOLL is O(T J(((K )u) 2.3 )), where J and T are iterations incurred by the cutting plane method and MKL, respectively. O(((K )u) 2.3 ) denotes the empirical complexity of SVM training. From our experience in running the experiments, J is generally less than a dozen and T is usually small as it depends on J. Upon convergence, the labels of X u can be derived as follows: For a K-class problem with K > 2 and by replacing f(x k ) in (3) as sign( t:y t S d tyt k ), the class label of x becomes (( k= I[sign( t:y t S d tyt k ) = ]) + ). This type of labeling is based on weighted voting in which each vote carries a learned weight d t. In addition, for a binary problem (i.e., K = 2), the labels of the target domain can be recovered using singular value decomposition on Y = t:y t S d ty t y t as D V [29, 3], where D and V are the largest eigenvalue and eigenvector, respectively. Then, the polarity of the groups learned by V can be determined with a majority vote by the source classifiers. D. Generalization error bound of TOLL In this subsection, we analyze the generalization absolute error bound of the proposed TOLL in the target domain. First, we define the joint distributions of the sth source domain and the target domain as P s and P t, respectively. Similarly, the marginal distributions of the sth source domain and the target domain are denoted by D s and D t, respectively. The expected errors of the sth source domain and the target domain are then given by and ϵ s (h) = E (x,y) P si[sign(h(x)) y], ϵ t (h) = E (x,y) P ti[sign(h(x)) y], respectively. Note that I[sign(h(x)) y is considered as zero-one loss function. Similarly, the expected errors for the kth extended class of the sth source domain and target domain are given by and ϵ s k(h) = E (x,y) P si[sign(h(x k )) y k ], ϵ t k(h) = E (x,y) P ti[sign(h(x k )) y k ], respectively. In addition, given two hypotheses, h and h 2, we define ϵ t (h, h 2 ) = E x D ti[sign(h (x)) sign(h 2 (x))]. In what follows, we first derive the generalization absolute error bound for a target hypothesis of ordinal regression in Theorem 2 and Theorem 3. After that, the generalization absolute error bound on the target data for TOLL will be derived in Theorem 4. Theorem 2. A hypothesis h of ordinal regression has the following generalization absolute error bound in the target domain: k= ϵ t k(h) (ϵ s k(h) + d s k(h) + λ s k) (7) k= where λ s k = min h H ϵ s k (h ) + ϵ t k (h ) and d s k (h) = ϵ t k (h, h ) ϵ s k (h, h ), and. denotes an absolute operator.

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX 6 Proof: From [47], a hypothesis h has the following generalization error bound in the target domain: ϵ t (h) ϵ s (h) + d s (h) + λ s (8) where λ s = min h H ϵ s (h ) + ϵ t (h ) and d s (h) = ϵ t (h, h ) ϵ s (h, h ). Using the extended Binary Classification model, the generalization error bound of the hypothesis h on the kth extended class is: ϵ t k(h) ϵ s k(h) + d s k(h) + λ s k. (9) Combining the error bounds for all ordinal labels, the proof is completed. Theorem 3. For a margin Λ > 0, with a probability of at least δ, a hypothesis h of ordinal regression has generalization absolute error bound in the target domain as: k= ϵ t k(h) (ˆϵ s k(h) + λ s k + d s k(h)), (20) k= where empirical ˆϵ s k (h) = i= I[ysk i h(x sk i ) Λ] + Γ s, while the confidence empirical risk Γ s = log ns n s n s O( n s, R Λ log, δ ) is such that K(x, x) + R2, w + θ, and h(x k ) = w x θ k. Proof: From the Theorem 6 of [48], a hypothesis h of ordinal regression has the following source generalization absolute error bound: k= ϵ s k(h) k= ˆϵ s k(h). (2) Next, by substituting (2) in (7), the proof is obtained. Theorem 4. A hypothesis h of ordinal regression in the proposed framework, TOLL, has generalization absolute error bound in the target domain as: S s= k= ϵt k (h) Z z= k= gs z(ˆϵ s k (h) + λs k + ds k (h)). (22) with S s= Z z= gs z =. Proof: Since TOLL imposes the inequality constraint of g s z 0, the following holds for any g s z on (20): gz s k= ϵ t k(h) gz s (ˆϵ s k(h) + λ s k + d s k(h)). (23) k= Then S s= Z z= gs z = of (23), the proof is completed. Using the generalization bound derived in (22), we proceed to discuss the solution obtained by the strategy in TOLL. With Algorithm, TOLL trains a classifier that minimizes the structural risk for each source domain, and then attains numerous hypotheses from the multiple relevant source classifiers, by projecting their bias parameters onto the target unlabeled data. Next, the weight gz s is obtained for each hypothesis via Algorithm 2. As the hypotheses are obtained from the source domains, it is reasonable for ˆϵ s 4 k (h) to be small. Furthermore, 4 If the empirical risk of a source domain is high, this source domain can be removed from being considered to form the hypotheses of TOLL. For simplicity, we assume the empirical risks of all source domains are acceptable so no removal is needed. although λ s k is unknown but to be in consistent with previously reported DA works, we shall assume λ s k to be small. Since both ˆϵ s k (h) and λs k of (22) are small, the remaining term to minimize shall reduce to S Z s= z= k= gs zd s k (h), where d s k (h) = ϵt k (h, h ) ϵ s k (h, h ). In what follows, we present the details to optimize this term. In particular, there are two cases to analyze d s k (h), namely, ϵt k (h, h ) ϵ s k (h, h ) and ϵ t k (h, h ) ϵ s k (h, h ). Remark. When ϵ t k (h, h ) ϵ s k (h, h ), we have d s k (h) = ϵ s k (h, h ) ϵ t k (h, h ). Note that since ϵ s k (h, h ) ˆϵ s k (h) + ϵ s k (h ) (based on triangle inequality) in which ϵ s k (h ) is part of λ s k that is assumed to be reasonably small, and ˆϵs k (h) (defined in (20)) can be estimated and chosen to be small 4, thus the bound for d s k (h) should also be reasonably small. Recall that minimizing (5) over ŷ M is equivalent to choosing a label vector ŷ that enforces a decision boundary that lies in lower density regions of the target unlabeled data. It is thus expected for ϵ t k (h, h ) to be small according to the cluster assumption [35, 36]. Remark 2. When ϵ t k (h, h ) ϵ s k (h, h ), we have d s k (h) = ϵ t k (h, h ) ϵ s k (h, h ). Hence, minimizing ϵ t k (h, h ) leads to the minimization of d s k (h) as well. In summary, the ensemble strategy proposed in TOLL alleviates the risk of choosing a poor source hypothesis. IV. EXPERIMENTAL STUDY In this section, subsections IV.A, IV.B, IV.C and IV.D describe the setting on class ratios of the source and target domains, the datasets (Sentiment, Newsgroup and ) used for evaluations, the state-of-the-art Algorithms considered in the study and the evaluation metric used to measure the performance, respectively. A. Setup on the class ratios of the source and target domains In practice, the true class distribution of the target domain is usually unknown. Thus, we begin with the investigation on the effects of various class ratios of the target data on the prediction accuracies in the present study. To carry out the investigation, the term Target Positive Class Ratio (TPCR) is introduced for the purpose of analyzing the impacts of various class ratios in the target domain, on the diverse learning algorithms considered. For binary problem (K = 2), TPCR defines the number of positive samples in the target domain. For example, in a set of 000 target samples, a TPCR of implies 300 samples are positive and the remaining are negative. In the experimental study, TPCR values of, and are investigated. In the case of K = 4 ordinal regression problem, the samples with labels belonging to the first half of the K classes are treated as positive and the rest of the sample are treated as negative. In addition, each class in their respective positive/negative group has equal number of samples. For example, a 4 class problem with 000 samples under the setting of TPCR= implies that each of class and 2 has 50 samples, while class 3 and 4 has 350 samples each.

7 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX 7 Besides investigating the various class distributions of the target domain, we also study the various class ratios of the source domains, since source sample selection bias is likely to be observed in the trained classifier that exhibits properties of steering towards the distribution of the source domain. Specifically, the imbalanced class ratio between the source to target domains is expected to aggravate the degree of source sample selection bias [49, 50]. Hence, in our study, the term Source Positive Class Ratio () is first introduced and defined to denote the number of positive samples in the source domain. In the experimental study, the robustness of each stateof-the-art algorithm for different configurations, particularly at value of,, and, are investigated for the different class ratios between the source and target domains. B. Multi-Domain Sentiment, Newsgroup and Datasets On Sentiment dataset, we consider the cases where K = 2 and K = 4. The dataset was prepared as reported in [4]. It comprises four categories of product reviews: Book, DVDs, Electronics, and Kitchen appliances from Amazon.com. For each task, one category is posed as the target domain while the rest as related source domains. Each review is marked with a five-star rating scale, where a higher star rating implies a better feedback. Note that in [4], the 3-star rating data have been removed to avoid ambiguities in the binary classification. In the context of binary (K = 2) problem, the negative samples are made up of -star and 2-star ratings, whereas the rest of the ratings form the positive samples. Hence, the task is to categorize the target testing data into positive and negative reviews. As for the context of (K = 4) problem, the task is to categorize the target testing data into star-ratings, 2, 4 and 5. In each of the tasks for both (K = 2) and (K = 4) problems, 2000 samples are randomly selected from each source domain to form the labeled data and 500 samples from the target domain as unlabeled data. TABLE I Grouping of source and target domains in Newsgroup dataset Domain Category comp Category rec Category sci Source windows.x motorcycles electronics Source 2 sys.ibm.pc.hardware sport.baseball med Source 3 sys.mac.hardware sport.hockey space Target graphics autos crypt On Newsgroup and datasets, (K = 2) is considered. Newsgroup dataset consists of three main categories: comp, rec, and sci. Each main category is then separated into Source, Source 2, Source 3 and Target (see Table I), resulting in three tasks: comp vs. rec, comp vs. sci and rec vs. sci. In particular, each task is to categorize the target testing data into their respective categories. The dataset considered here is available at the ECML/PKDD 2006 discovery challenge 5. The source and target domains consist of spam and non-spam s from user and public inboxes, respectively. The task is then defined as to categorize the target testing data into spam and non-spam s. In each of the tasks, 000 samples 5 are randomly selected from each source domain to form the labeled data, while 500 samples from the target domain as unlabeled data. Since the problems of interest are text datasets, they are preprocessed with single and bi- terms extracted, stopwords removed, stemming and normalizing of each feature performed. Consequently, each feature of the sample is represented by its respective tf-idf value. Further, the linear kernel is employed in the experimental study. C. State-of-the-art Algorithms Considered In the present study, several state-of-the-art algorithms are investigated for diverse TPCR and settings considered on datasets involving three source domains (Sentiment and Newsgroup datasets) or one source domain ( dataset) and a target domain: ) S-SVM : Each source domain is trained using the SVM 6 and the lowest balanced absolute error among the classifiers is reported. 2) 2S-SVM : Each unique pair of source domains is trained using the SVM and the lowest balanced absolute error among the classifiers is reported. 3) MCC: Multiple Convex Combination denotes a representative of DA method that linearly combines all source classifiers trained based on the SVM [20]. Since the present study involves three source domains, MCC is equivalent to a 3S-SVM. 4) LG-MMC: Label Generating Maximum Margin Clustering 7 [3] maximizes the margin separating two opposite clusters of the target unlabeled data without the use of any label information available in the source domains. Since LG-MMC does not use any class label information, we assume the class labels assigned to the respective clusters to be the true class labels that will give the lowest balanced absolute error. Since LG-MMC does not consider the ordinal constraint, LG-MMC is only used on binary problems (i.e, K = 2). 5) KMM : Kernel Mean Matching addresses the marginal distribution differences between a single source domain and a target domain by re-weighting each of the source samples in the Reproducing Kernel Hilbert Space (RKHS) such that the Maximum Mean Discrepancy (MMD) criterion defined on the source and target domains [7] is minimized 8. A weighted SVM is then trained on the source domain using the derived weight of each sample. One KMM is trained for each source domain and the lowest balanced absolute error among the classifiers is reported. 6) TCA : Transfer Component Analysis assumes there exists some feature map with similar predictive distributions between a single source domain and a target domain, i.e., P S (y x) P T (y x), where superscripts S 6 The ordinal SVM code used is available publicly htlin/program/libsvm/#ordinal 7 The program is downloaded from v2.rar 8 The weights of the source samples are learned using quadraticprogramming, as stated in (2) of [7].

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX 8 and T refer to source domain and target domain, respectively. Hence, it learns a set of transfer components in the RKHS based on the MMD criterion, and subsequently the SVM is trained on the source domain in this RKHS [8]. One TCA is trained for each source domain and the lowest balanced absolute error among the classifiers is reported. 7) TOLL: Transfer Ordinal Label Learning learns the labels of the target unlabeled data by maximizing the margin of separation in the target data based on the label space spanned by a linear combination of source classifiers, as described in Figure. The parameters of all methods are configured by means of the k-fold cross-source domains validation scheme suggested in [5]. It denotes an extension of the standard k-fold cross validation for DA learning. Here, k is the number of source domains, i.e., k = m. Specifically, each partition represents a source domain in k-fold cross-source domains validation. In addition, β is fixed as in LG-MMC and TOLL (as given in Algorithm ). D. Evaluated Performance Metric For ordinal problem, the absolute error is commonly used as the criterion for defining accuracy, which gives the absolute difference between the predicted label and the ground truth label. In particular, the smaller the absolute error, the nearer the predicted label are to the ground truth label. However, in cases where the source and target class distributions differ, the balanced error can be considered [52, 53]. Taking this cue, the balanced absolute error is considered as the evaluation criterion for the ordinal regression problem considered in the present study, which is defined as follows: K K k= ( u i= ŷi yi I[yi=k] u i= I[y i=k] ) (24) where denotes the absolute function. For each of the tasks, 0 independent runs are conducted and the average results are reported. V. EXPERIMENTAL RESULTS AND DISCUSSIONS In this section, we first perform a study on the validity of the cluster assumption used in TOLL, before proceeding with the discussion and analysis on the experimental results for the algorithms considered. Lastly, an experimental study is carried out to investigate the time complexities of the compared stateof-the-art methods A. Case Study on Cluster Assumption in TOLL In this section, we analyze the validity of the cluster assumption criterion employed in TOLL. Recall that TOLL begins with a computation of the source classifiers for generating the set of potential class labels yz s for the target unlabeled data, as outlined in Algorithm. We plot the margin of separation 2 w for each yz s obtained from solving min w,ρ,ξ in (5) with the Kitchen Appliances serving as the target domain. SubFigures 2(a), 2(b) and 2(c) depicts the plots of yz s generated by S- SVM as the source classifiers trained on the source domains: Book, DVDs, and Electronics, respectively. SubFigures 2(d), 2(e) and 2(f) are plots of yz s generated by 2S-SVM as source classifiers from two combinatorial source domains: Book and DVDs, Book and Electronics, and DVDs and Electronics, respectively. Then, SubFigure 2(g) depicts the plot of yz s generated by 3S-SVM (MCC) as the source classifier on all the source domains. In particular, Figure 2 shows the plots of yz s of a particular run using Kitchen Appliances as the target domain with the settings of K = 4, TPCR= and =. The line in each subfigures 2(a) to 2(g) regresses the linear trend of the plots of yz. s From the slope of the lines, it is indicative 2 that the increasing margin ( w ) is associated with a general decrease in the balanced absolute error. These results imply that appropriate choice of source classifiers based on the large target margin criterion can minimize the balance absolute error in the target domain. Further, the cluster assumption made in TOLL to minimize the target generalization absolute error (Theorem 4) by means of maximizing the target margin is valid. In addition, although the plots of yz s in all the SubFigures of Figure 2 share similar range of margin values, their balanced absolute errors can be observed to differ much. This highlights that when no apriori knowledge on choosing the most suitable source domain is available, a strategy of source domains ensemble as proposed in TOLL, thus serves as important for robust accuracy prediction. B. Experimental Result and Discussion on Sentiment Dataset Figures 3 and 4 summarize the balanced absolute error of the target unlabeled data obtained on the Sentiment prediction dataset for K = 2 and K = 4, respectively. The three subfigures at the left denote the results obtained on the target domain for a positive class ratio (TPCR) of, whereas the remaining three subfigures at the right present the results for TPCR of. Subfigures 3(a), 3(d), 4(a) and 4(d) summarize the balanced absolute error of the target domain on the DVDs dataset for varying degrees of in the source domain. On the other hand, subfigures 3(b,e) and 4(b,e), and subfigures 3(c,f) and 4(c,f) display the results for the case where the Electronics and Kitchen Appliances datasets as the target domain of interest, respectively. For the sake of conciseness, the experimental results of the Book dataset as the target domain are omitted from this paper since similar trends to the other datasets studied have been observed on the considered algorithms. In addition, since the results for TPCR= is symmetrical to that of TPCR=, all other target domains for TPCR of are also omitted. As observed from Figure 3, LG-MMC exhibited the worst balanced absolute error among all the methods under investigation. This indicates that unsupervised approach based on maximal margin separation of the unlabeled data without any use of label information, is deemed to be less effective than DA methods due to the abundance of labeled data from other related source domains that can be appropriately used to compliment class predictions on target unlabeled data. The results obtained thus conclude the effectiveness of DA methods

9 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX 9 5 Src B w Src D w Src E w (a) (b) (c) 5 Src B + D w Src B + E w Src D + E w Src B + D +E w (d) (e) (f) (g) Fig. 2. Margin vs. Absolute Error for different source domains while having the Kitchen Appliances as the target domain of interest with a setting of K = 4, TPCR= and =. The points in each subfigures denote the class labels yz s obtained by Algorithm and each yz s has a respective margin obtained 2 from solving min w,ρ,ξ in (5). The x-axis represents the margin, while y-axis denotes the balanced absolute error. B, D, E symbolize the Book, DVDs w and Electronics, respectively. Please refer to the text for more details. on Sentiment data in the absence of target label information 9. It can also be observed from the results in Figures 3 and 4 that the S-SVM underperforms MCC (i.e., 3S-SVM) and 2S- SVM in general. Nevertheless, when source domains are used at equal weights (i.e., 2S-SVM and MCC, respectively) the results in Figures 3 and 4 show significant degradations, due to the imbalanced class ratios between the source and target domains. Note that when the setting approaches either extremes (i.e., and ) of the experimental study considered, the performances of most methods, except TOLL, can be observed to degrade significantly. Since both KMM and TCA operate by minimizing the marginal distribution differences between the target and source domains according to the MMD criterion [7], the degradations in performances thus indicate that the necessary assumption made on similar predictive distributions between source and target domains for KMM and TCA does not hold on the Sentiment data. At the same time, these results also indicate that imbalanced class ratios between the source and target domains do lead to source sample selection bias. From Figure 4, KMM and TCA are noted to have attained lower performances than the others in general. The former method operates by reweighting the source labeled data so as to match the marginal distributions of the target data, while assuming a common class distribution shared by the source and target domains. Hence the poor prediction results are observed when their class distributions are dissimilar. The latter method remaps the kernel space in a way as to minimize the distance between the source and target domain such that samples of star-rating and 2 are reconfigured to be closer together, and similarly for the samples of star-rating 4 and 5. This explains why the performances obtained by TCA, 9 Note that LG-MMC does not appear in Figure 4 (K = 4) since it does not consider ordinal class labels. in Figure 4, is noted to be poor on ordinal problems, while exhibits rewarding results on the binary sentiment problem, as observed in Figure 3. While the performances of the DA methods are observed to suffer from source sample selection bias due to the differing class ratios between the source and target domains, TOLL is observed to perform robustly across the range of and TPCR settings considered. TOLL also attained the lowest balanced absolute error, in relation to all the other methods for the extreme configurations of, i.e., and, as observed from all the subfigures. This implies that TOLL is capable of choosing a robust linear combination of source label vectors that represent the label space of the target unlabeled data. It maximizes the margin of separation solely based on target unlabeled data in the target label space that is spanned by label vectors generated from multiple independent source classifiers (i.e., the bias parameter of each source classifier is projected on the target unlabeled data). It is also worth highlighting that while TCA has outperformed all other algorithms at TPCR of with of on the Sentiment (K = 2) dataset (see Figure 3(a,b,c)), the reported TCA results are chosen from the best among three results, each of which is obtained by applying TCA on different source domains. The three results on different source domains, each of which trained using TCA with DVDs as the target domain, are reported in Figure 5. Furthermore, the balanced absolute error of each source domain trained using SVM, which is denoted as S-SVM, is also depicted in Figure 5. In practice, it is non-trivial to determine in advance which source domain is the most suitable for the target domain beforehand, especially in the absence of prior knowledge on the target domain. TOLL thus fills this gap by providing an ensemble of suitable source classifiers to attain improved predictive performance in the target domain

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX 0 DVDs Electronics Kitchen Appliances r r r S-SVM 2S-SVM MCC LG-MMC KMM TCA TOLL TPCR= (a) (b) 5 (c) r r r TPCR= (d) (e) 5 Fig. 3. Balanced absolute error for K = 2 on Sentiment dataset where left section having target domain s positive class ratio (TPCR) as and the right section is TPCR=. The x-axis is the various source domain s positive class ratio () settings and the y-axis is the balanced absolute error. Please refer to the text for more details. (f) DVDs Electronics Kitchen Appliances S-SVM 2S-SVM MCC KMM TCA T OLL.2 TPCR=.2 (a).2 (b) (c) r.2 TPCR= (d).2 (e).2 (f) Fig. 4. r for K = 4 on Sentiment dataset where left section having target domain s positive class ratio (TPCR) as and the right section is TPCR=. The x-axis is the various source domain s positive class ratio () settings and the y-axis is the balanced absolute accuracy. Please refer to the text for more details. of interest. Therefore, in general, the results in Figure 5 shows that TCA performed much worse than TOLL on settings of, and. C. Newsgroup And Experimental Result Discussions The results for the Newsgroup dataset are reported in Figure 6. We can observe that LG-MMC achieved decent performances on the Newsgroup data. Particularly, LG-MMC reported improved balanced absolute error over S-SVM, 2S-SVM, MCC and KMM for of and in most of the subfigures illustrated. This implies that solely learning from target unlabeled data can sometimes be more beneficial than enlisting the additions of labeled samples from other source domains, especially when the target data is well separated (cluster assumption). S-SVM also operates based on maximizing the margin of separation but training is concentrated on the source domain where source sample selection bias creeps in. On the other hand, KMM improved the results of S-SVM by means of using the Maximum Mean Discrepancy criterion but still fares poorer than LG-MMC. On the other hand, TCA and TOLL achieved significantly lower balanced absolute error than LG-MMC, S-SVM and KMM. In overall, TOLL emerged as superior to all other methods in all experimental settings considered, except on the rec vs. sci task where TPCR=. The details of rec vs. sci task where TPCR= is depicted in Figure 7 and it is observed that

11 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX r 0 Target DVDs TPCR=. S-SVM-Book TCA-Book S-SVM-Electronics 85 TCA-Electronics 80 S-SVM-Kitchen 75 TCA-Kitchen S- 70 SVM- TOLL Book 65 S- Fig. 5. DVDs as Target Domain in Sentiment Experiments for K = 2. Comparisons among S-SVM, TCA and TOLL. S-SVM-X or TCA-X where X refers to the source domain being used to classify DVDs test data. TCA performed lower relative to TOLL if source 2 or source 3 is considered in the training process for classifying the target unlabeled data. Therefore, the selection of appropriate source domains in TCA is an essential task that bears great impacts on its effectiveness. However, in practice, it is difficult to determine the most appropriate source domain for TCA beforehand. In general, the results in Figure 6 showed that TOLL displayed high robustness and superior in prediction accuracy throughout the entire range of and TPCR considered. The results on the dataset are reported in Figure 8. MCC, KMM and TCA reported the best accuracies when is at. However, the accuracies of MCC, KMM and TCA exhibited declining trends as the approaches to. This observation is due to the task of detecting spam s (positive samples) being easier than identifying nonspam s (negative sample). On the other hand, since LG- MMC learns only from target unlabeled data, it does not suffer from source sample selection bias as observed in the figure when much less negative samples are available than positive samples (i.e., of ). Nevertheless, LG-MMC are observed to exhibit poor accuracy across the entire range of s. It is worth mentioning that TCA reported the best accuracy among all methods at of. Therefore, minimizing the marginal distribution differences between the target and source domains, according to the MMD criterion, through finding the transfer components in RKHS do help. Nevertheless, the approach still suffers performance degradations when the class distributions between source and target domains differed. On the other hand, TOLL achieved better performances than all the methods considered for in the range of to and displayed robust results across the entire range of settings. Last but not least, we also did a Wilconxon signed-ranks test [Demsar2006SCC] on all the results on Figures 2, 3, 4, 6 and 8 and reported a 99 D. Comparison of the time complexities of the state-of-the-art methods In what follows, we discuss the theoretical analysis of the following methods. S-SVM, 2S-SVM and MCC use SVM as the classifier, hence they exhibit a time complexity of O(((K )n) 2.3 ), which is assumed as the empirical comp vs. rec comp vs. sci rec vs. sci S-SVM 2S-SVM MCC LG-MMC KMM TCA TOLL TPCR= (a) (b) (c) TPCR= (d) (e) (f) Fig. 6. Newsgroup Experimental Results where left section having target domain s positive class ratio (TPCR) as and the right section is TPCR=. The x-axis is the various source domain s positive class ratio () settings and the y-axis is the balanced absolute accuracy. Please refer to the text for more details. complexity of SVM training, where n is the number of source labeled data. KMM is solved using quadratic-programming with a time complexity of O(n 3 ). TCA, on the other hand, is solved with eigen-decomposition and has a time complexity of O((n + u) 3 ). For TOLL, the computational complexity is O(T J(((K )u) 2.3 )), where K is the number of classes, u is the number of target unlabeled data, while J and T are the number of iterations incurred by the cutting plane method and MKL, respectively. Thus, TOLL takes a factor of JT ( u n )2.3 JT ()u2.3 JT ()u2.3, ( n ) and ( 3 (n+u) ) over MCC, 3 KMM and TCA, respectively. Hence, when the product of J, T, K and u is much greater than n, TOLL will display a higher computational complexity than the other methods. On

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX 2 TABLE II TRAINING TIME(SECONDS) OF VARIOUS METHODS ON THE KITCHEN APPLIANCES(K = 4) DATASET WITH TPCR= AND = S-SVM 2S-SVM MCC KMM TCA TOLL the other hand, when an abundance of source data is available such that n u, the proposed TOLL is faster. To verify the theoretical analysis, an experimental study is carried out to investigate the training times of the methods on Kitchen Appliances(K = 4) as target domain. Note that training size of both S-SVM and KMM is 2000 while training sizes of 2S-SVM, MCC, TCA and TOLL are 4000, 6000, 2500 and 500, respectively. The training times (in term of seconds) of the aforementioned methods are detailed in Table II. Since S-SVM, 2S-SVM, MCC share the same computational complexity, the method with the most training samples is expected to have a longer training time as observed in Table II. Furthermore, as the source labeled data increases, which leads to having a smaller fraction of target unlabeled data to source labeled data, the fraction of training time of TOLL to other method is also expected to be smaller. Therefore, the observations from Table II show that the fraction of training time of TOLL to MCC is smaller than that of TOLL to 2S- SVM and TOLL to -SVM. It is also worth mentioning that the training times of TCA, KMM and S-SVM are consistent with the theoretical aforementioned computational complexities of those methods. Last but not least, TOLL took the longest time to train a classifier. Nevertheless, TOLL is observed to be robust across the 8 tasks as depicted in Figures 3, 4, 6 and 8. VI. CONCLUSION A core challenge of transfer learning in attaining reliable classifier from relevant source domains is the induction of source sample selection bias, such that the eventual classifier trained often steers towards to the distribution of the source domain. And this bias is deemed to become more severe on data involving multiple classes. Taking this cue, we have proposed a Transfer Ordinal Label Learning or TOLL paradigm that predicts the ordinal labels of target unlabeled data by spanning the feasible solution space with ordinal classifiers from multiple relevant source domains. In contrast to previous works, the maximum margins between two consecutive ordinal classes are employed as the criterion for selection and/or fusions of appropriate source ordinal classifiers when designing the target classifier. In this manner, the proposed approach thus learns a target ordinal classifier that involves only the kernel expansion of the target data. Through comprehensive experimental studies, TOLL is shown to display superiority and robustness across the entire range of imbalanced source and target class ratio settings when pitted against several state-of-the-art methods, which is in contrast to other counterpart methods which suffered significantly in the prediction accuracies. Last but not least, TOLL is significantly better than all the compared methods over all the datasets anced Absolute Error Bala 0 Rec vs. Sci TPCR=. S SVM Source TCA Source S SVM Source2 TCA Source2 S SVM Source3SVM Source3 TCA Source3 TOLL Fig. 7. Rec vs. Sci in Newsgroup Experiments. Comparisons among S- SVM, TCA and TOLL. S-SVM-X or TCA-X where X refers to the source domain (see Table I) being used to classify the target test data. MCC LG-MMC KMM TCA TOLL (a) TPCR= (b) TPCR= Fig. 8. Spam Experimental Results. The x-axis is the various source domain s positive class ratio () settings and the y-axis is the balanced absolute accuracy. Please refer to the text for more details. considered in the experimental study based on Wilconxon signed-ranks test [54] with 99% confidence. REFERENCES [] P. Wu and T. G. Dietterich, Improving SVM Accuracy by Training on Auxiliary Data Sources, in ICML, Banff, Alberta, Canada, 2004, pp [2] H. Shimodaira, Improving predictive inference under covariate shift by weighting the log-likelihood function, J. Statist. Plann. Inference, vol. 90, no. 2, pp , [3] M. Sugiyama and K.-R. Müller, Input-dependent estimation of generalization error under covariate shift, Statist. & Decis., vol. 23, no. 4, pp , [4] A. J. Storkey and M. Sugiyama, Mixture regression for covariate shift, in NIPS, British Columbia, Canada, 2006, pp [5] S. Bickel, M. Brückner, and T. Scheffer, Discriminative Learning Under Covariate Shift, JMLR, vol. 0, no. 0, pp , [6] X. Liao, Y. Xue, and L. Carin, Logistic regression with an auxiliary data source, in ICML, Bonn, Germany, 2005, pp [7] J. Huang, A. Smola, A. Gretton, K. M. Borgwardt, and B. Schölkopf, Correcting Sample Selection Bias by Unlabeled Data, in NIPS, Vancouver, British Columbia, Canada, 2006, pp [8] S. Bickel, M. Brckner, and T. Scheffer, Discriminative learning for differing training and test distributions, in ICML, Corvallis, Oregon, USA, 2007, pp [9] M. Sugiyama, M. Krauledat, and K.-R. Müller, Covariate shift adaptation by importance weighted cross validation, JMLR, vol. 8, no. 5, pp , [0] J. Jiang and C. Zhai, Instance weighting for Domain Adaptation in NLP, in ACL, Prague, Czech Republic, 2007, pp [] A. Gretton, K. Borgwardt, M. Rasch, B. Scholkopf, and A. Smola, A kernel method for the two-sample-problem, in NIPS, Vancouver, B.C., Canada, 2007, pp [2] M. Sugiyama, T. Suzuki, and T. Kanamori, Density ratio matching under the Bregman divergence: A unified framework of density ratio estimation, Ann. Inst. Statist. Math., vol., pp. 36, 20. [3] J. Blitzer, R. McDonald, and F. Pereira, Domain Adaptation with Structural Correspondence Learning, in EMNLP, Sydney, Australia, 2006, pp [4] J. Blitzer, M. Dredze, and F. Pereira, Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification, in ACL, Prague, Czech Republic, 2007, pp [5] H. Daumé III, Frustratingly easy domain adaptation, in ACL, Prague, Czech Republic, 2007, pp

13 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX 3 [6] W. Dai, O. Jin, G.-R. Xue, Q. Yang, and Y. Yu, Eigentransfer: a unified framework for transfer learning, in ICML, Montreal, Quebec, Canada, 2009, pp [7] E. Zhong, W. Fan, J. Peng, K. Zhang, J. Ren, D. Turaga, and O. Verscheure, Cross domain distribution adaptation via kernel mapping, in KDD, Paris, France, 2009, pp [8] S. J. Pan, I. Tsang, J. Kwok, and Q. Yang, Domain Adaptation via Transfer Component Analysis, TNN, vol. 22, no. 2, pp , 20. [9] S. J. Pan, X. Ni, J.-T. Sun, Q. Yang, and Z. Chen, Cross-domain sentiment classification via spectral feature alignment, in WWW, Raleigh, North Carolina, USA, 200, pp [20] G. Schweikert, C. Widmer, B. Schölkopf, and G. Rätsch, An Empirical Analysis of Domain Adaptation Algorithm for Genomic Sequence Analysis, in NIPS, Vancouver, British Columbia, Canada, 2009, pp [2] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, On combining classifiers, TPAMI, vol. 20, no. 3, pp , 998. [22] X. Shi, Q. Liu, W. Fan, Q. Yang, and P. S. Yu, Predictive Modeling with Heterogeneous Sources, in SDM, Columbus, Ohio,USA, 200, pp [23] L. Duan, D. Xu, and I. W.-H. Tsang, Domain adaptation from multiple sources: A domain-dependent regularization approach, TNNLS, vol. 23, no. 3, pp , 202. [24] R. Herbrich, T. Graepel, and K. Obermayer, Support vector learning for ordinal regression, in ICANN, Edinburgh, 999, pp [25] W. Chu and S. S. Keerthi, New approaches to support vector ordinal regression, in ICML, Bonn, Germany, 2005, pp [26] J. J. Heckman, Sample selection bias as a specification error, Econometrica, vol. 47, no., pp. 53 6, 979. [27] F. Vella, Estimating models with sample selection bias: A survey, J. HUM. Res., vol. 33, no., pp , 998. [28] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence, Dataset Shift in Machine Learning. The MIT Press, [29] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans, Maximum margin clustering, in NIPS, Vancouver, British Columbia, Canada, 2005, pp [30] W.-S. Zheng, S. Gong, and T. Xiang, Quantifying and transferring contextual information in object detection, TPAMI, vol. 34, no. 4, pp , 202. [3] Y.-F. Li, I. W. Tsang, J. T. Kwok, and Z.-H. Zhou, Tighter and convex maximum margin clustering, in AISTATS, Clearwater Beach, Florida, USA, 2009, pp [32] J. J. Lim, R. Salakhutdinov, and A. Torralba, Transfer learning by borrowing examples for multiclass object detection, in NIPS, Granada, Spain, 20, pp [33] A. Ahmed, K. Yu, W. Xu, Y. Gong, and E. Xing, Training hierarchical feedforward visual recognition models using transfer learning from pseudo-tasks, in ECCV, Marseille, France, 2008, pp [34] A. Farhadi, D. A. Forsyth, and R. White, Transfer learning in sign language, in CVPR, Minneapolis, Minnesota, USA, 2007, pp. 8. [35] T. Joachims, Transductive Inference for Text Classification using Support Vector Machines, in ICML, Bled, Slovenia, 999, pp [36] C.-W. Seah, I. W. Tsang, and Y.-S. Ong, Transductive ordinal regression, TNNLS, vol. 23, no. 7, pp , 202. [37] C.-W. Seah, I.-T. Tsang, and Y.-S. Ong, Healing sample selection bias by source classifier selection, in ICDM, Vancouver, BC, Canada, 20, pp [38] L. Li and H.-T. Lin, Ordinal regression by extended binary classification, in NIPS, Vancouver, British Columbia, Canada, 2006, pp [39] J. S. Cardoso and J. F. Pinto da Costa, Learning to classify ordinal data: The data replication method, JMLR, vol. 8, no. 2, pp , [40] P. A. Gutiérrez, M. Pérez-Ortiz, F. Fernández-Navarro, J. Sánchez-Monedero, and C. Hervás-Martínez, An experimental study of different ordinal regression methods and measures, in HAIS (2), 202. [4] C. E. Rasmussen and C. Williams, Gaussian Processes for Machine Learning, [42] S. Boyd and L. Vandenberghe, Convex Optimization, [43] S.-J. Kim and S. Boyd, A Minimax Theorem with Applications to Machine Learning, Signal Processing, and Finance, SIAM J. on Optimization, vol. 9, no. 3, pp , [44] G. Lanckriet, N. Cristianini, P. Bartlett, and L. E. Ghaoui, Learning the kernel matrix with semidefinite programming, JMLR, vol. 5, no., pp , [45] Z. Xu, R. Jin, H. Yang, I. King, and M. R. Lyu, Simple and efficient multiple kernel learning by group lasso. in ICML, Haifa, Israel, 200, pp [46] J. Kelley, J. E., The Cutting-Plane Method for Solving Convex Programs, Society for Industrial and Applied Mathematics, vol. 8, no. 4, pp , 960. [47] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman, Learning bounds for domain adaptation, in Advances in Neural Information Processing Systems, [48] H.-T. Lin and L. Li, Reduction from cost-sensitive ordinal ranking to weighted binary classification, Neural Computation, vol. 24, no. 5, pp , 202. [49] C.-W. Seah, I. W. Tsang, Y.-S. Ong, and K.-K. Lee, Predictive Distribution Matching SVM for Multi-domain Learning, in ECML/PKDD, Barcelona, Spain, 200, pp [50] L. Bruzzone and M. Marconcini, Domain Adaptation Problems: A DASVM Classification Technique and a Circular Validation Strategy, TPAMI, vol. 32, no. 5, pp , 200. [5] J. Jiang and C. Zhai, A two-stage approach to domain adaptation for statistical classifiers, in CIKM, Lisbon, Portugal, 2007, pp [52] N. V. Chawla, N. Japkowicz, and A. Kotcz, Editorial: special issue on learning from imbalanced data sets, SIGKDD Explorations Newsletter, vol. 6, pp. 6, [53] M. Sokolova, N. Japkowicz, and S. Szpakowicz, Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation, Artificial Intelligence, vol. 4304, pp , [54] J. Demvsar, Statistical comparisons of classifiers over multiple data sets, JMLR, vol. 7, no. 2, pp. 30, Chun-Wei Seah received his Ph.D. and B.Eng. (first-class honors) degree in computer science from the School of Computer Engineering, Nanyang Technological University, Singapore, in 2009 and 203, respectively. He is currently a senior member of technical staff at the Defence Science Organisation National Laboratories. His current research interests include transductive learning, transfer learning, rank learning and sentiment prediction. Mr. Seah is a recipient of the Nanyang President s Graduate Scholarship in Ivor W. Tsang received the Ph.D. degree in computer science from the Hong Kong University of Science and Technology, Kowloon, Hong Kong, in He is currently an Assistant Professor with the School of Computer Engineering, Nanyang Technological University (NTU), Singapore. He is the Deputy Director of the Center for Computational Intelligence, NTU. Dr. Tsang received the prestigious IEEE TRANSACTIONS ON NEURAL NET- WORKS Outstanding 2004 Paper Award in 2006 and the 2008 National Natural Science Award (Class II), China, in His co-authored papers also received the Best Student Paper Award at the 23rd IEEE Conference on Computer Vision and Pattern Recognition in 200, the Best Paper Award at the 23rd IEEE International Conference on Tools with Artificial Intelligence in 20, the 20 Best Student Paper Award from PREMIA, Singapore, in 202, and the Best Paper Award from the IEEE Hong Kong Chapter of Signal Processing Postgraduate Forum in He was also conferred with the Microsoft Fellowship in Yew-Soon Ong received the BS and MS degrees in electrical and electronics engineering from Nanyang Technological University (NTU), Singapore, in 998 and 999, respectively. He completed the PhD degree on artificial intelligence in complex design from the Computational Engineering and Design Center, University of Southampton, UK in He is currently an Associate Professor and Director of the Center for Computational Intelligence at the School of Computer Engineering, NTU. Dr. Ong is the founding Technical Editor-in-Chief of Memetic Computing Journal, Chief Editor of the Springer book series on studies in adaptation, learning, and optimization, Associate Editor of IEEE Computational Intelligence Magazine, IEEE Transactions on Systems, Man and Cybernetics Part B, Soft Computing, Information Sciences, International Journal of System Sciences and many others. He also Chairs the IEEE Computational Intelligence Society Emergent Technology Technical Committee and has served as Guest Editors of several journals. His research interest in computational intelligence spans across Memetic Computing, Evolutionary Design, Machine Learning, Agent-based systems and Cloud computing.

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Lecture Notes on Support Vector Machine

Lecture Notes on Support Vector Machine Lecture Notes on Support Vector Machine Feng Li fli@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional space, a hyper plane is defined by ω T x + b = 0 (1) where ω R n is

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Machine Learning. Lecture 6: Support Vector Machine. Feng Li. Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Warm Up 2 / 80 Warm Up (Contd.)

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 5: Multi-class and preference learning Juho Rousu 11. October, 2017 Juho Rousu 11. October, 2017 1 / 37 Agenda from now on: This week s theme: going

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Vikas Sindhwani, Partha Niyogi, Mikhail Belkin Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Support Vector Machines. Manfred Huber Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

Support Vector Machines Explained

Support Vector Machines Explained December 23, 2008 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Transductive Experiment Design

Transductive Experiment Design Appearing in NIPS 2005 workshop Foundations of Active Learning, Whistler, Canada, December, 2005. Transductive Experiment Design Kai Yu, Jinbo Bi, Volker Tresp Siemens AG 81739 Munich, Germany Abstract

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces Notes on the framework of Ando and Zhang (2005 Karl Stratos 1 Beyond learning good functions: learning good spaces 1.1 A single binary classification problem Let X denote the problem domain. Suppose we

More information

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2. CS229 Problem Set #2 Solutions 1 CS 229, Public Course Problem Set #2 Solutions: Theory Kernels, SVMs, and 1. Kernel ridge regression In contrast to ordinary least squares which has a cost function J(θ)

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Manifold Regularization

Manifold Regularization 9.520: Statistical Learning Theory and Applications arch 3rd, 200 anifold Regularization Lecturer: Lorenzo Rosasco Scribe: Hooyoung Chung Introduction In this lecture we introduce a class of learning algorithms,

More information

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved

More information

CS798: Selected topics in Machine Learning

CS798: Selected topics in Machine Learning CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

Instance-based Domain Adaptation via Multi-clustering Logistic Approximation

Instance-based Domain Adaptation via Multi-clustering Logistic Approximation Instance-based Domain Adaptation via Multi-clustering Logistic Approximation FENG U, Nanjing University of Science and Technology JIANFEI YU, Singapore Management University RUI IA, Nanjing University

More information

Pattern Recognition 2018 Support Vector Machines

Pattern Recognition 2018 Support Vector Machines Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht

More information

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods Foundation of Intelligent Systems, Part I SVM s & Kernel Methods mcuturi@i.kyoto-u.ac.jp FIS - 2013 1 Support Vector Machines The linearly-separable case FIS - 2013 2 A criterion to select a linear classifier:

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information

Support Vector Machines

Support Vector Machines EE 17/7AT: Optimization Models in Engineering Section 11/1 - April 014 Support Vector Machines Lecturer: Arturo Fernandez Scribe: Arturo Fernandez 1 Support Vector Machines Revisited 1.1 Strictly) Separable

More information

Multi-Label Informed Latent Semantic Indexing

Multi-Label Informed Latent Semantic Indexing Multi-Label Informed Latent Semantic Indexing Shipeng Yu 12 Joint work with Kai Yu 1 and Volker Tresp 1 August 2005 1 Siemens Corporate Technology Department of Neural Computation 2 University of Munich

More information

Models, Data, Learning Problems

Models, Data, Learning Problems Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Models, Data, Learning Problems Tobias Scheffer Overview Types of learning problems: Supervised Learning (Classification, Regression,

More information

Sparse Gaussian conditional random fields

Sparse Gaussian conditional random fields Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian

More information

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Joint distribution optimal transportation for domain adaptation

Joint distribution optimal transportation for domain adaptation Joint distribution optimal transportation for domain adaptation Changhuang Wan Mechanical and Aerospace Engineering Department The Ohio State University March 8 th, 2018 Joint distribution optimal transportation

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Brief Introduction to Machine Learning

Brief Introduction to Machine Learning Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector

More information

Support Vector Machines, Kernel SVM

Support Vector Machines, Kernel SVM Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM

More information

Classification and Support Vector Machine

Classification and Support Vector Machine Classification and Support Vector Machine Yiyong Feng and Daniel P. Palomar The Hong Kong University of Science and Technology (HKUST) ELEC 5470 - Convex Optimization Fall 2017-18, HKUST, Hong Kong Outline

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Support Vector Machines for Classification and Regression

Support Vector Machines for Classification and Regression CIS 520: Machine Learning Oct 04, 207 Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

Support Vector Learning for Ordinal Regression

Support Vector Learning for Ordinal Regression Support Vector Learning for Ordinal Regression Ralf Herbrich, Thore Graepel, Klaus Obermayer Technical University of Berlin Department of Computer Science Franklinstr. 28/29 10587 Berlin ralfh graepel2

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Lecture 10: A brief introduction to Support Vector Machine

Lecture 10: A brief introduction to Support Vector Machine Lecture 10: A brief introduction to Support Vector Machine Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department

More information

Support Vector Machines on General Confidence Functions

Support Vector Machines on General Confidence Functions Support Vector Machines on General Confidence Functions Yuhong Guo University of Alberta yuhong@cs.ualberta.ca Dale Schuurmans University of Alberta dale@cs.ualberta.ca Abstract We present a generalized

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

Support Vector Machine for Classification and Regression

Support Vector Machine for Classification and Regression Support Vector Machine for Classification and Regression Ahlame Douzal AMA-LIG, Université Joseph Fourier Master 2R - MOSIG (2013) November 25, 2013 Loss function, Separating Hyperplanes, Canonical Hyperplan

More information

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning Short Course Robust Optimization and 3. Optimization in Supervised EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012 Outline Overview of Supervised models and variants

More information

Classifier Complexity and Support Vector Classifiers

Classifier Complexity and Support Vector Classifiers Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

An Analytical Comparison between Bayes Point Machines and Support Vector Machines

An Analytical Comparison between Bayes Point Machines and Support Vector Machines An Analytical Comparison between Bayes Point Machines and Support Vector Machines Ashish Kapoor Massachusetts Institute of Technology Cambridge, MA 02139 kapoor@mit.edu Abstract This paper analyzes the

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Part of the slides are adapted from Ziko Kolter

Part of the slides are adapted from Ziko Kolter Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Support Vector Machines CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification

More information

Support vector machines

Support vector machines Support vector machines Jianxin Wu LAMDA Group National Key Lab for Novel Software Technology Nanjing University, China wujx2001@gmail.com May 10, 2018 Contents 1 The key SVM idea 2 1.1 Simplify it, simplify

More information

Support Vector Ordinal Regression using Privileged Information

Support Vector Ordinal Regression using Privileged Information Support Vector Ordinal Regression using Privileged Information Fengzhen Tang 1, Peter Tiňo 2, Pedro Antonio Gutiérrez 3 and Huanhuan Chen 4 1,2,4- The University of Birmingham, School of Computer Science,

More information

Support Vector Machine & Its Applications

Support Vector Machine & Its Applications Support Vector Machine & Its Applications A portion (1/3) of the slides are taken from Prof. Andrew Moore s SVM tutorial at http://www.cs.cmu.edu/~awm/tutorials Mingyue Tan The University of British Columbia

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

What is semi-supervised learning?

What is semi-supervised learning? What is semi-supervised learning? In many practical learning domains, there is a large supply of unlabeled data but limited labeled data, which can be expensive to generate text processing, video-indexing,

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Reading: Ben-Hur & Weston, A User s Guide to Support Vector Machines (linked from class web page) Notation Assume a binary classification problem. Instances are represented by vector

More information

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights Linear Discriminant Functions and Support Vector Machines Linear, threshold units CSE19, Winter 11 Biometrics CSE 19 Lecture 11 1 X i : inputs W i : weights θ : threshold 3 4 5 1 6 7 Courtesy of University

More information

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Support vector machines (SVMs) are one of the central concepts in all of machine learning. They are simply a combination of two ideas: linear classification via maximum (or optimal

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

Homework 3. Convex Optimization /36-725

Homework 3. Convex Optimization /36-725 Homework 3 Convex Optimization 10-725/36-725 Due Friday October 14 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How

More information

Learning Kernel Parameters by using Class Separability Measure

Learning Kernel Parameters by using Class Separability Measure Learning Kernel Parameters by using Class Separability Measure Lei Wang, Kap Luk Chan School of Electrical and Electronic Engineering Nanyang Technological University Singapore, 3979 E-mail: P 3733@ntu.edu.sg,eklchan@ntu.edu.sg

More information

Review: Support vector machines. Machine learning techniques and image analysis

Review: Support vector machines. Machine learning techniques and image analysis Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization

More information

Lecture 10: Support Vector Machine and Large Margin Classifier

Lecture 10: Support Vector Machine and Large Margin Classifier Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Class Prior Estimation from Positive and Unlabeled Data

Class Prior Estimation from Positive and Unlabeled Data IEICE Transactions on Information and Systems, vol.e97-d, no.5, pp.1358 1362, 2014. 1 Class Prior Estimation from Positive and Unlabeled Data Marthinus Christoffel du Plessis Tokyo Institute of Technology,

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable. Linear SVM (separable case) First consider the scenario where the two classes of points are separable. It s desirable to have the width (called margin) between the two dashed lines to be large, i.e., have

More information