Transfer Ordinal Label Learning

Size: px

Start display at page:

Download "Transfer Ordinal Label Learning"

Randell Barrett
6 years ago
Views:

1 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX Transfer Ordinal Label Learning Chun-Wei Seah, Ivor W. Tsang, Yew-Soon Ong Abstract Designing a classifier in the absence of labeled data is becoming a common encounter as the acquisition of informative labels is often difficult or expensive, particularly on new uncharted target domains. The feasibility of attaining a reliable classifier for the task of interest has been embarked by some in transfer learning, where label information from relevant source domains has been considered for complimenting the design process. The core challenge arising from such endeavors, however, is the induction of source sample selection bias, such that the trained classifier has the tendency of steering towards the distribution of the source domain. And this bias is deemed to become more severe on data involving multiple classes. Taking this cue, our interest in this paper is to address such a challenge in the target domain, where ordinal labeled data are unavailable. In contrast to previous works, we propose a Transfer Ordinal Label Learning or TOLL paradigm to predict the ordinal labels of target unlabeled data by spanning the feasible solution space with ensemble of ordinal classifiers from multiple relevant source domains. Specifically, the maximum margin criterion is considered here for the construction of the target classifier from an ensemble of source ordinal classifiers. Theoretical analysis and extensive empirical studies on real-world datasets are presented to study the benefits of the proposed method. Index Terms Transfer Learning, Domain Adaptation, Ordinal Regression, Source Sample Selection Bias, Classifier Selection I. INTRODUCTION To date, many practical realizations of machine intelligence are making their way as important tools that assist humans in their decision making process. A motivating example is sentiment rating prediction on user reviews as a tool for crafting novel marketing strategies on newly launched products (referred to as the target domain). Each user review can then be categorized into different star-ratings (often represented as ordinal labels in machine classification), where a higher starrating indicates a better feedback on the product. In practice, most newly launched products have many user comments posted on the Internet. However, it is usually the case that few of such comments are readily tagged with sentiment star-rating labels. To address the absence of such label information, the field of Domain Adaptation (DA) learning has embarked the feasibility study on classifier for new target domains using the available label information of other related source domains. The initial work of DA as proposed in [], presented a study involving the use of single related source domain that shares common joint distribution with the target domain of interest. Manuscript received December 3, 202; revised March 2, 203 and June 6, 203; accepted June 0, 203. This research is partially supported by Multi-plAtform Game Innovation Centre (MAGIC) in Nanyang Technological University. MAGIC is funded by the Interactive Digital Media Programme Office (IDMPO) hosted by the Media Development Authority of Singapore. The authors are with School of Computer Engineering, Nanyang Technological University, Singapore , SChunWei@dso.edu.sg,{IvorTsang,asYSOng}@ntu.edu.sg Subsequent works have moved on to relax the strict common joint distribution assumption. In particular, dissimilarity in the marginal distributions among domains has been established as covariate shift [2 5]. To date, a common remedy to address the covariate shift issue is by means of instance re-weighting [6 0], where the weight of each source data sample vector is defined according to the density-ratio of the target P t (x) to source P s P (x) marginal distributions, i.e., t P s (x). In this manner, the dissimilarities between the source and target domains are modeled with different marginal distribution P (x), while the similarity of predictive distributions P (y x) across different domains are preserved. The Kernel-Mean Matching (KMM) method [7], for instance, considers first an estimate on the weight of each source sample vector by minimizing the Maximum Mean Discrepancy (MMD) [] between the source labeled samples and the target unlabeled samples. The re-weighted source samples are subsequently used for training the target classifier. More recently, a unified framework of the density-ratio estimation based on the Bregman divergence has been proposed [2], and included KMM as one of its variants. Another popular scheme in the DA field is to seek for an appropriate feature representation of the source domain that corresponds well to the feature space of the target domain [3 9]. For instance, by minimizing the MMD between the source and target samples, Transfer Component Analysis (TCA) [8] identifies a suitable latent space spanned by some basis vectors, referred to as the transfer components. In recent years, many advancing DA methods have broadened their scopes to consider leveraging the label information from multiple relevant source domains. For instance, by treating every source domain equally, Multiple Convex Combinations (MCC) formulates the target classifier as a fusion of multiple Support Vector Machines (SVM) classifiers that are learned from the individual relevant source domains [20, 2]. However, it is worth noting that a simple and direct compilation of all data in the source domains to complement the target learning task can lead to adverse outcomes [22], especially when the classifier learned from the source data fails to serve as discriminative on the target data. As such, an extension of the MCC, labeled as the Domain Adaptation Machine (DAM) in [23], was subsequently proposed where prior knowledge on the source and target domains is incorporated to define the importance of each source classifier. More importantly, it is worth highlighting that in general, all DA methods train the target classifier by minimizing the empirical risk defined based only on the source data or its weighted samples. With such a design process, the classifier is likely to exhibit properties that are steered towards the distribution of the source domain and this inevitably induces biases in the resultant prediction, thus potentially leading

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX 2 to poor accuracy on the prediction of the target unseen data. Particularly, we consider the study of transfer ordinal label learning since the bias is expected to be more severe when multiple classes are involved [24, 25]. We refer the phenomenon here as the source sample selection bias. To alleviate the source sample selection bias, it is generally advisable to directly minimize the expected risk functional defined only on the target data, for example, by means of leveraging from any prior knowledge that may be available on the output label structure of the target domain. An intuitive solution is nonetheless to group the target unlabeled samples via an unsupervised learning paradigm, subjected to some criteria imposed, such as the cases of Maximum Margin Clustering (MMC) and Maximum Margin Context described in [29] and [30], respectively. Particularly, MMC maximizes the margin between opposite clusters by considering all possible combinations of labels on the target unlabeled samples. To be specific, MMC optimizes the labels of u unlabeled samples from c u unique label combinations for a c class problem. MMC, however, has its limits. In not taking the class structure such as the abundance of label information that are readily available in the related source domains (for instance, ordinal class labels in the context of ordinal regression), the tendencies of under-performing DA methods in general are high. Besides, the approach may sometimes lead to trivial solutions, such as the case where all samples are grouped with the same class label and hence deemed as futile [29, 3]. In this paper, our interest lies in addressing the challenges pertaining to source sample selection bias in the absence of target labeled data. In contrast to existing DA works and MCC, we proposed a novel Transfer Ordinal Label Learning approach or TOLL in short, which imposes the maximum margin criterion on the target unlabeled data in the process of constructing the target classifier from an ensemble of source ordinal classifiers. Here, this paper assumes the source and target domains to share the same tasks 2. In the absence of target labeled data, it is reasonable to assume that the feasible solution space of the target ordinal labels can be spanned by a series of source ordinal classifiers. The core contributions of the present paper are summarized as follows: ) Existing DA methods that seek for instance reweighting or appropriate feature representation have to date only taken the marginal distribution differences between source and target domains into considerations. Furthermore, it has been established that the effects of source sample selection bias become more severe and challenging in the context of ordinal problems. Despite the advancements on DA approaches, to date none has considered making use of ordinal information in their framework as means to improve ordinal predictions, mainly because the transfer of output structures from source to target domains is a non-trivial task. To the best of our knowledge, this paper thus presents the first Note that sample selection bias is well-known in econometrics [26, 27] and in dataset shift [28], and covariate shift is considered as one of its variant [9]. 2 In the event where the source and target domains originate from different tasks, the reader is then referred to [32 34]. DA work that embarks an investigation on the issues pertaining to source sample selection bias under the challenging context of ordinal regression. Particularly, TOLL learns the ordinal labels of the target unlabeled data from a convex hull of the ordinal outputs that are predicted by multiple source classifiers, namely the label vectors. 2) We present the generalization absolute error bound for ordinal regression in the target domain. Our analysis shows that, when the target unlabeled data follows the cluster assumption [35, 36] well, the classifier with a large target margin can reduce this error bound. In the experimental study of the sentiment classification application, the results manifest that the ensemble source ordinal classifiers with a larger target margin is associated with a smaller testing absolute error in the target domain. This verifies the appropriateness and effectiveness in choosing discriminative source classifiers for ordinal regression in the DA setting. 3) Furthermore, our extensive experimental studies highlight that TOLL emerged as superior to several state-ofthe-art DA methods in most of the tasks considered, and is robust to various settings of differing class distribution ratios between the source and target domains. The rest of this paper is organized as follows: Section II gives the preliminaries and a brief review on ordinal regression. Section III introduces the formulation of TOLL and implementation details. Extensive experiments on Sentiment, Newsgroup and datasets are then carried out in Section IV. The experimental results are then analyzed and discussed in Sections V. Lastly, the conclusive remarks of this paper are drawn in Section VI. A preliminary work of TOLL can be found in [37] and this paper serves as a significant add on, which includes but is not limited to, the extension to ordinal regression, derivation of generalization absolute error bound and experimental study on ordinal regression problems. II. PRELIMINARIES AND REVIEW OF ORDINAL REGRESSION In this section, the notation symbols used in the present manuscript and a brief review of the extended binary classification model for ordinal regression are presented. A. Notations Throughout the rest of this paper, a superscript denotes the transpose of a vector or a matrix, denotes the element-wise product operator, I[ ] denotes an indicator function that returns a if the predicate holds, otherwise a zero is returned, and sign( ) is a function that returns if the input is negative; otherwise, + is returned. Moreover, defines a vector with all ones. Given m source domains and one target domain X u, which contains u unlabeled (testing) samples, x j R p, the task in Domain Adaptation (DA) is to leverage from the available labeled data in relevant source domains, to predict the class label ŷ j {, 2,..., K} of each unlabeled sample in the target domain involving a K ordinal class problem. In addition, a K ordinal class problem is represented by K

3 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX 3 Source Domain Precomputed Classifiers Source Domain m Fig.. Unlabeled data Step : Generate target label vectors (see Algo. ) Generate Target Label Space Unlabeled data in Target Domain Target label space Unlabeled data Step 2: Transfer Ordinal Label Learning (see Algo. 2) Transfer Ordinal Label Learning Framework Labels of target unlabeled data Learning the Ordinal Labels of Target Unlabeled Data ordered thresholds: θ θ 2... θ where θ 0 = and θ K =. A predictive output h(x) of sample x that falls between θ k h(x) θ k is thus classified as class k. B. Extended Binary Classification Model for Ordinal Regression In this subsection, we briefly outline an extended binary classification model that has showcased state-of-the-art performances for ordinal regression [38, 39] 3. An ordinal labeled samples (x, y) can be extended to K binary samples in SVM algorithm via the following transformation: x k = (x, e k ) R p+, y k = 2I[y k], for k =, 2,..., K, where e k R denotes a vector with the kth element being one, while the rest of the elements are zero. As an extended binary sample has a dimension of (p + K ), the weighted vector w of SVM is also augmented to become (w, θ), which is used to give the binary predictive value of x k as: () f(x k ) = sign((w, θ) x k ) = sign(h(x) θ k ), (2) where h(x) = w x. Using (2), the predictive class label of sample x is then given as follows: k= I[f(x k ) = ] +. (3) III. THE PROPOSED TRANSFER ORDINAL LABEL LEARNING Figure depicts the learning process of the proposed TOLL framework. Without loss of generality, source classifiers are first trained for each unique combination of source domains. The source classifiers can be trained using any DA method that is readily available. Note that the source classifier can even be precomputed so as to preserve the interests of a company, such as the privacy and security of customer data. In TOLL, the relevancy and specificity of each source classifier is then learned with respect to the target domain. In particular, TOLL alleviates the presence of any unwanted sample selection bias that may exist by learning the biases of each source classifier, based on prior knowledge available on the output 3 Note that a very similar idea was previously presented in [40]. label structure of the target domain. All these source classifiers with different biases are subsequently used to span the target label space (see Sec. III-A). Once the target label space is formed, TOLL proceeds to simultaneously learn the weight of each source classifier and the target classifier for the domain of interest, in a manner where the margin of separation in the target label space is maximized (see Sec. III-B). A. Generating Target Label Space from Multiple Sources Using the complimentary labeled data from multiple relevant source domains, appropriate target classifier can be derived from an ensemble of source classifiers for the purpose of target unlabeled data prediction. In what follows, the procedure to generate the label space for a given set of target unlabeled data, referred to as target label space, is discussed. An outline of the procedure is summarized in Algorithm. Given the availability of m source domains, the design process begins with the construction of a classifier in each source domain and also a classifier for each combination of 2, 3,..., (m ) source domains, until S possible combinations of the m source domains have been explored, i.e., S = m m! i= i!(m i)! classifiers are trained. Note that diverse forms of source classifiers can be trained, either based on SVM, Gaussian Process [4], Transductive SVM [35] or any other variants of supervised, semi-supervised or DA methods. Without loss of generality, we consider supervised SVM in the present manuscript. Like most models, each of the S source classifiers includes a bias term b such that the decision boundary is not restricted to intersect only at the origin. TOLL leverages from the biases of source classifiers to generate label vectors y = [y,..., y (),..., yu,..., y u () ] for the target unlabeled data, where [yi,..., y() i ] {, } denote the extended class labels of the ith sample. Since the source classifiers may be trained from source domains that are of differing distributions to the target domain, it is more beneficial to determine the bias b based on the target data. Hence, we propose to define the bias b of source classifiers in such a way where the label vector y of the target unlabeled data satisfies the following balance constraint: u K ( β) [( u i= I ) j I[y j i = ] + = k ( + β), k {,..., K}, u K where β is the hyper-parameter to restrict the imbalanced class size q k for the kth extended class label in the label vector. This constraint can be implicitly imposed by sorting the classifier s decision outputs of the target unlabeled data and forms at most Z = ( u K 2β) unique label vectors. Hence, the target label space is spanned by S Z label vectors. With the S Z label vectors, the target label space, M, is then defined as follows: M= {ŷ = S Z s= z= gs zyz s S Z s= z= gs z = ; gz s 0, u K ( β) u i= I[( j I[y sj zi = ] + ) = k] u K (+β) k =,.., K z =,..., Z, s =,..., S}, (4) ]

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX 4 Algorithm Generation of the target label space : Inputs: F (a set of source classifiers (precomputed) trained from each unique combination of source domains), β controls the imbalance of label vectors 2: Outputs: Y (a set of generated label vectors for target unlabeled data) 3: for all f s F do 4: indexes=sort(f s (x i ),..., f s (x u )) 5: z =, q 0 = ; 6: for each unique set of {q,, q K }, where u K ( β) q k u K ( + β), K k= q k = u, k =,..., K do 7: create a vector yz s R u() 8: for C =,..., K do 9: assign yz s from the C k=0 q kth index to C k= q kth index as the extended class label C 0: end for : Y = Y yz; s z = z + ; 2: end for 3: end for 4: return Y where the importance of each source classifier, yz, s is weighted by gz, s and without loss of generality, the extended binary class labels of X u is denoted by y = [y,..., y (),..., yu,..., y u () ]. In addition, M forms the convex hull of the target output label space [42]. B. Proposed Formulation To alleviate the source sample selection bias, we propose the minimization of the expected risk by taking only the target unlabeled samples into consideration. Particularly in TOLL, learning the labels of the unlabeled samples is conducted by minimizing the following structural risk using hinge loss function of SVM: min ŷ M { min w,ρ,ξ 2 w θ 2 2 ρ + C u k= i= s.t. ŷi k ϕ(x (w i ) θ k ) ρ ξi k, ξk i 0 i =,..., u, k =,..., K, } θ k θ k+ k =,..., K, where ϕ(x) maps x into a high dimensional space, ŷi k [+, ]. w ϕ(x) is the predictive function, ρ is the maximum error allowable before ξ i (the slack variable) is penalized, and C denotes the regularization parameter that tradeoffs between model complexity and empirical risk. Since the hinge loss employed in the inner minimization (i.e., enclosed by {} in (5)) is non-increasing, the ordered constraints on θ θ 2... θ are implicitly fulfilled (see the proof of Theorem 2 in [38]). With the outer minimization of (5) over ŷ, the optimal decision function w ϕ(x) is essentially the solutions with decision boundaries lying in the low density regions of the target unlabeled data [36]. Furthermore, TOLL learns the weight of each label vector yz s (as predicted by a source classifier) in (5) by minimizing the structural risk involving the ξ k i (5) target samples only. In this manner, the kernel expansion of the target classifier will only be defined by data samples in the target domain. Note that in the event that some target labeled data do exist, such information can be easily incorporated into TOLL by simply imposing the labels of the target labeled data for the available y s z. C. Optimization in TOLL In what follows, the detailed steps to solve (5) of TOLL are presented. First, the Lagrangian of the inner minimization in (5), enclosed by {}, can be written as follows: L = 2 w θ 2 2 ρ + C u i= k= u i= k= αk i (yk i ϕ(x (w i ) θ k ) ρ + ξi k u ) λ k i ξi k. i= k= where αi k 0 and λk i 0 are the Lagrangian multipliers of the inequality constraints. According to the KKT condition, we have: w = u i= k= αk i yk i ϕ(x i), (7) ξ k i (6) θ k = u i= αk i yk i, (8) C = α k i + λk i, (9) u i= k= αk i =. (0) Substituting (7), (8), (9) and (0) back into (6), we have max α u 2 i,j= k,k = αk i αk j yk i yk j (K(xk i, xk j )). where K(x k i, xk j ) = ϕ(x i)ϕ(x j ) + I[k = k ]. We further define α = {α,..., α,..., αu,..., αu }, and A = {α u i= k= αk i =, 0 αi k C, i =,..., u, k =,..., K }, then (5) is simplified as follows: { } max (K ŷŷ )α. α A 2 α () min ŷ M Since A and M are both compact sets and according to the minimax theorem [43], swapping the order of the min and max in () is equivalent to: max α A max α A { min ŷ M 2 α (K ŷŷ )α In addition, (2) can be reformulated as: { max Ψ, Ψ s.t. Ψ 2 α (K y t y t)α, y t M }. (2) }. (3) Moreover, the dual form of the inner maximization of (3) is: { max min α A d D 2 α ( ) } d t K y t y t α, (4) t:y t M where d denotes a vector of Lagrangian multipliers d t and D = {d t:y t M d t =, d t 0 t : y t M} is the domain

5 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX 5 Algorithm 2 Transfer Ordinal Label Learning (TOLL) : Inputs: M 2 (a set of source label vectors generated by Algorithm ) 2: α = u, then find the most violated y t in (6) and let S = {y t } 3: repeat 4: Find optimal d S and α in (5) via MKL 5: Find the most violated y t by (6) and set S = S y t 6: until convergence 7: return d t, y t t : y t S of d. Since D and A are both compact sets, swapping the order of the max and min in (4) is equivalent to: { min max d D α A 2 α ( ) } d t K y t y t α. (5) t:y t M Note that the set M in (5) corresponds to the base kernels of the Multiple Kernel Learning (MKL) problem [44]. Hence, (5) can be solved using the efficient MKL solvers [45]. In the presence of a significant number of source classifiers, solving by MKL may not be efficient. Fortunately, as it is unlikely for all of the constraints in (5) to be active simultaneously at the optimal solution, the efficient cutting plane method can be efficiently deployed [46] in solving (5) (see Algorithm 2). The algorithm begins with the initialization of α = u and then locates the most violated constraint of (6) that also fails the constraint in (3). Theorem. The most violated constraint of (3) for a fixed α is then: arg max (K yy )α, 2 α (6) where M 2 = {y,..., yz,..., ys,..., yz S } y M 2 Proof: Let f(y) = 2 (K yy )α. Since f() is a α convex function, f(( λ)y i + λy j ) ( λ)f(y i ) + λf(y j ), y i, y j M 2, λ [0, ] according to the convexity property. If the predicate f(y i ) > f(y j ), then f(( λ)y i + λy j ) f(y i ). Similarity, if f(y i ) < f(y j ), then f(( λ)y i + λy j ) f(y j ). Therefore, the predicate f(( λ)y i +λy j ) max(f(y i ), f(y j )) holds. By induction [42], f(λ y λ Z y Z λs y S λ S Z ys Z ) (arg max y M2 f(y)) given S Z s= z= λs z = and λ s z [0, ]. Note that to solve (6), no numerical optimization solver is needed since the maximum objective value is simply obtained by computing all the objective values in the set M 2 and then the most violated y t corresponds to that with the highest value among those computed. Hence, the first active constraint is chosen based on the most violated y t. Thereafter, the current set of selected constraints are solved via MKL before obtaining the next most violated constraint for inclusion into the set of constraints. The process of finding the next most violated constraint is repeated until convergence. Empirically, only a few iterations is needed for Algorithm 2 to converge. The overall time complexity of TOLL is O(T J(((K )u) 2.3 )), where J and T are iterations incurred by the cutting plane method and MKL, respectively. O(((K )u) 2.3 ) denotes the empirical complexity of SVM training. From our experience in running the experiments, J is generally less than a dozen and T is usually small as it depends on J. Upon convergence, the labels of X u can be derived as follows: For a K-class problem with K > 2 and by replacing f(x k ) in (3) as sign( t:y t S d tyt k ), the class label of x becomes (( k= I[sign( t:y t S d tyt k ) = ]) + ). This type of labeling is based on weighted voting in which each vote carries a learned weight d t. In addition, for a binary problem (i.e., K = 2), the labels of the target domain can be recovered using singular value decomposition on Y = t:y t S d ty t y t as D V [29, 3], where D and V are the largest eigenvalue and eigenvector, respectively. Then, the polarity of the groups learned by V can be determined with a majority vote by the source classifiers. D. Generalization error bound of TOLL In this subsection, we analyze the generalization absolute error bound of the proposed TOLL in the target domain. First, we define the joint distributions of the sth source domain and the target domain as P s and P t, respectively. Similarly, the marginal distributions of the sth source domain and the target domain are denoted by D s and D t, respectively. The expected errors of the sth source domain and the target domain are then given by and ϵ s (h) = E (x,y) P si[sign(h(x)) y], ϵ t (h) = E (x,y) P ti[sign(h(x)) y], respectively. Note that I[sign(h(x)) y is considered as zero-one loss function. Similarly, the expected errors for the kth extended class of the sth source domain and target domain are given by and ϵ s k(h) = E (x,y) P si[sign(h(x k )) y k ], ϵ t k(h) = E (x,y) P ti[sign(h(x k )) y k ], respectively. In addition, given two hypotheses, h and h 2, we define ϵ t (h, h 2 ) = E x D ti[sign(h (x)) sign(h 2 (x))]. In what follows, we first derive the generalization absolute error bound for a target hypothesis of ordinal regression in Theorem 2 and Theorem 3. After that, the generalization absolute error bound on the target data for TOLL will be derived in Theorem 4. Theorem 2. A hypothesis h of ordinal regression has the following generalization absolute error bound in the target domain: k= ϵ t k(h) (ϵ s k(h) + d s k(h) + λ s k) (7) k= where λ s k = min h H ϵ s k (h ) + ϵ t k (h ) and d s k (h) = ϵ t k (h, h ) ϵ s k (h, h ), and. denotes an absolute operator.

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX 6 Proof: From [47], a hypothesis h has the following generalization error bound in the target domain: ϵ t (h) ϵ s (h) + d s (h) + λ s (8) where λ s = min h H ϵ s (h ) + ϵ t (h ) and d s (h) = ϵ t (h, h ) ϵ s (h, h ). Using the extended Binary Classification model, the generalization error bound of the hypothesis h on the kth extended class is: ϵ t k(h) ϵ s k(h) + d s k(h) + λ s k. (9) Combining the error bounds for all ordinal labels, the proof is completed. Theorem 3. For a margin Λ > 0, with a probability of at least δ, a hypothesis h of ordinal regression has generalization absolute error bound in the target domain as: k= ϵ t k(h) (ˆϵ s k(h) + λ s k + d s k(h)), (20) k= where empirical ˆϵ s k (h) = i= I[ysk i h(x sk i ) Λ] + Γ s, while the confidence empirical risk Γ s = log ns n s n s O( n s, R Λ log, δ ) is such that K(x, x) + R2, w + θ, and h(x k ) = w x θ k. Proof: From the Theorem 6 of [48], a hypothesis h of ordinal regression has the following source generalization absolute error bound: k= ϵ s k(h) k= ˆϵ s k(h). (2) Next, by substituting (2) in (7), the proof is obtained. Theorem 4. A hypothesis h of ordinal regression in the proposed framework, TOLL, has generalization absolute error bound in the target domain as: S s= k= ϵt k (h) Z z= k= gs z(ˆϵ s k (h) + λs k + ds k (h)). (22) with S s= Z z= gs z =. Proof: Since TOLL imposes the inequality constraint of g s z 0, the following holds for any g s z on (20): gz s k= ϵ t k(h) gz s (ˆϵ s k(h) + λ s k + d s k(h)). (23) k= Then S s= Z z= gs z = of (23), the proof is completed. Using the generalization bound derived in (22), we proceed to discuss the solution obtained by the strategy in TOLL. With Algorithm, TOLL trains a classifier that minimizes the structural risk for each source domain, and then attains numerous hypotheses from the multiple relevant source classifiers, by projecting their bias parameters onto the target unlabeled data. Next, the weight gz s is obtained for each hypothesis via Algorithm 2. As the hypotheses are obtained from the source domains, it is reasonable for ˆϵ s 4 k (h) to be small. Furthermore, 4 If the empirical risk of a source domain is high, this source domain can be removed from being considered to form the hypotheses of TOLL. For simplicity, we assume the empirical risks of all source domains are acceptable so no removal is needed. although λ s k is unknown but to be in consistent with previously reported DA works, we shall assume λ s k to be small. Since both ˆϵ s k (h) and λs k of (22) are small, the remaining term to minimize shall reduce to S Z s= z= k= gs zd s k (h), where d s k (h) = ϵt k (h, h ) ϵ s k (h, h ). In what follows, we present the details to optimize this term. In particular, there are two cases to analyze d s k (h), namely, ϵt k (h, h ) ϵ s k (h, h ) and ϵ t k (h, h ) ϵ s k (h, h ). Remark. When ϵ t k (h, h ) ϵ s k (h, h ), we have d s k (h) = ϵ s k (h, h ) ϵ t k (h, h ). Note that since ϵ s k (h, h ) ˆϵ s k (h) + ϵ s k (h ) (based on triangle inequality) in which ϵ s k (h ) is part of λ s k that is assumed to be reasonably small, and ˆϵs k (h) (defined in (20)) can be estimated and chosen to be small 4, thus the bound for d s k (h) should also be reasonably small. Recall that minimizing (5) over ŷ M is equivalent to choosing a label vector ŷ that enforces a decision boundary that lies in lower density regions of the target unlabeled data. It is thus expected for ϵ t k (h, h ) to be small according to the cluster assumption [35, 36]. Remark 2. When ϵ t k (h, h ) ϵ s k (h, h ), we have d s k (h) = ϵ t k (h, h ) ϵ s k (h, h ). Hence, minimizing ϵ t k (h, h ) leads to the minimization of d s k (h) as well. In summary, the ensemble strategy proposed in TOLL alleviates the risk of choosing a poor source hypothesis. IV. EXPERIMENTAL STUDY In this section, subsections IV.A, IV.B, IV.C and IV.D describe the setting on class ratios of the source and target domains, the datasets (Sentiment, Newsgroup and ) used for evaluations, the state-of-the-art Algorithms considered in the study and the evaluation metric used to measure the performance, respectively. A. Setup on the class ratios of the source and target domains In practice, the true class distribution of the target domain is usually unknown. Thus, we begin with the investigation on the effects of various class ratios of the target data on the prediction accuracies in the present study. To carry out the investigation, the term Target Positive Class Ratio (TPCR) is introduced for the purpose of analyzing the impacts of various class ratios in the target domain, on the diverse learning algorithms considered. For binary problem (K = 2), TPCR defines the number of positive samples in the target domain. For example, in a set of 000 target samples, a TPCR of implies 300 samples are positive and the remaining are negative. In the experimental study, TPCR values of, and are investigated. In the case of K = 4 ordinal regression problem, the samples with labels belonging to the first half of the K classes are treated as positive and the rest of the sample are treated as negative. In addition, each class in their respective positive/negative group has equal number of samples. For example, a 4 class problem with 000 samples under the setting of TPCR= implies that each of class and 2 has 50 samples, while class 3 and 4 has 350 samples each.

7 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX 7 Besides investigating the various class distributions of the target domain, we also study the various class ratios of the source domains, since source sample selection bias is likely to be observed in the trained classifier that exhibits properties of steering towards the distribution of the source domain. Specifically, the imbalanced class ratio between the source to target domains is expected to aggravate the degree of source sample selection bias [49, 50]. Hence, in our study, the term Source Positive Class Ratio () is first introduced and defined to denote the number of positive samples in the source domain. In the experimental study, the robustness of each stateof-the-art algorithm for different configurations, particularly at value of,, and, are investigated for the different class ratios between the source and target domains. B. Multi-Domain Sentiment, Newsgroup and Datasets On Sentiment dataset, we consider the cases where K = 2 and K = 4. The dataset was prepared as reported in [4]. It comprises four categories of product reviews: Book, DVDs, Electronics, and Kitchen appliances from Amazon.com. For each task, one category is posed as the target domain while the rest as related source domains. Each review is marked with a five-star rating scale, where a higher star rating implies a better feedback. Note that in [4], the 3-star rating data have been removed to avoid ambiguities in the binary classification. In the context of binary (K = 2) problem, the negative samples are made up of -star and 2-star ratings, whereas the rest of the ratings form the positive samples. Hence, the task is to categorize the target testing data into positive and negative reviews. As for the context of (K = 4) problem, the task is to categorize the target testing data into star-ratings, 2, 4 and 5. In each of the tasks for both (K = 2) and (K = 4) problems, 2000 samples are randomly selected from each source domain to form the labeled data and 500 samples from the target domain as unlabeled data. TABLE I Grouping of source and target domains in Newsgroup dataset Domain Category comp Category rec Category sci Source windows.x motorcycles electronics Source 2 sys.ibm.pc.hardware sport.baseball med Source 3 sys.mac.hardware sport.hockey space Target graphics autos crypt On Newsgroup and datasets, (K = 2) is considered. Newsgroup dataset consists of three main categories: comp, rec, and sci. Each main category is then separated into Source, Source 2, Source 3 and Target (see Table I), resulting in three tasks: comp vs. rec, comp vs. sci and rec vs. sci. In particular, each task is to categorize the target testing data into their respective categories. The dataset considered here is available at the ECML/PKDD 2006 discovery challenge 5. The source and target domains consist of spam and non-spam s from user and public inboxes, respectively. The task is then defined as to categorize the target testing data into spam and non-spam s. In each of the tasks, 000 samples 5 are randomly selected from each source domain to form the labeled data, while 500 samples from the target domain as unlabeled data. Since the problems of interest are text datasets, they are preprocessed with single and bi- terms extracted, stopwords removed, stemming and normalizing of each feature performed. Consequently, each feature of the sample is represented by its respective tf-idf value. Further, the linear kernel is employed in the experimental study. C. State-of-the-art Algorithms Considered In the present study, several state-of-the-art algorithms are investigated for diverse TPCR and settings considered on datasets involving three source domains (Sentiment and Newsgroup datasets) or one source domain ( dataset) and a target domain: ) S-SVM : Each source domain is trained using the SVM 6 and the lowest balanced absolute error among the classifiers is reported. 2) 2S-SVM : Each unique pair of source domains is trained using the SVM and the lowest balanced absolute error among the classifiers is reported. 3) MCC: Multiple Convex Combination denotes a representative of DA method that linearly combines all source classifiers trained based on the SVM [20]. Since the present study involves three source domains, MCC is equivalent to a 3S-SVM. 4) LG-MMC: Label Generating Maximum Margin Clustering 7 [3] maximizes the margin separating two opposite clusters of the target unlabeled data without the use of any label information available in the source domains. Since LG-MMC does not use any class label information, we assume the class labels assigned to the respective clusters to be the true class labels that will give the lowest balanced absolute error. Since LG-MMC does not consider the ordinal constraint, LG-MMC is only used on binary problems (i.e, K = 2). 5) KMM : Kernel Mean Matching addresses the marginal distribution differences between a single source domain and a target domain by re-weighting each of the source samples in the Reproducing Kernel Hilbert Space (RKHS) such that the Maximum Mean Discrepancy (MMD) criterion defined on the source and target domains [7] is minimized 8. A weighted SVM is then trained on the source domain using the derived weight of each sample. One KMM is trained for each source domain and the lowest balanced absolute error among the classifiers is reported. 6) TCA : Transfer Component Analysis assumes there exists some feature map with similar predictive distributions between a single source domain and a target domain, i.e., P S (y x) P T (y x), where superscripts S 6 The ordinal SVM code used is available publicly htlin/program/libsvm/#ordinal 7 The program is downloaded from v2.rar 8 The weights of the source samples are learned using quadraticprogramming, as stated in (2) of [7].

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX 8 and T refer to source domain and target domain, respectively. Hence, it learns a set of transfer components in the RKHS based on the MMD criterion, and subsequently the SVM is trained on the source domain in this RKHS [8]. One TCA is trained for each source domain and the lowest balanced absolute error among the classifiers is reported. 7) TOLL: Transfer Ordinal Label Learning learns the labels of the target unlabeled data by maximizing the margin of separation in the target data based on the label space spanned by a linear combination of source classifiers, as described in Figure. The parameters of all methods are configured by means of the k-fold cross-source domains validation scheme suggested in [5]. It denotes an extension of the standard k-fold cross validation for DA learning. Here, k is the number of source domains, i.e., k = m. Specifically, each partition represents a source domain in k-fold cross-source domains validation. In addition, β is fixed as in LG-MMC and TOLL (as given in Algorithm ). D. Evaluated Performance Metric For ordinal problem, the absolute error is commonly used as the criterion for defining accuracy, which gives the absolute difference between the predicted label and the ground truth label. In particular, the smaller the absolute error, the nearer the predicted label are to the ground truth label. However, in cases where the source and target class distributions differ, the balanced error can be considered [52, 53]. Taking this cue, the balanced absolute error is considered as the evaluation criterion for the ordinal regression problem considered in the present study, which is defined as follows: K K k= ( u i= ŷi yi I[yi=k] u i= I[y i=k] ) (24) where denotes the absolute function. For each of the tasks, 0 independent runs are conducted and the average results are reported. V. EXPERIMENTAL RESULTS AND DISCUSSIONS In this section, we first perform a study on the validity of the cluster assumption used in TOLL, before proceeding with the discussion and analysis on the experimental results for the algorithms considered. Lastly, an experimental study is carried out to investigate the time complexities of the compared stateof-the-art methods A. Case Study on Cluster Assumption in TOLL In this section, we analyze the validity of the cluster assumption criterion employed in TOLL. Recall that TOLL begins with a computation of the source classifiers for generating the set of potential class labels yz s for the target unlabeled data, as outlined in Algorithm. We plot the margin of separation 2 w for each yz s obtained from solving min w,ρ,ξ in (5) with the Kitchen Appliances serving as the target domain. SubFigures 2(a), 2(b) and 2(c) depicts the plots of yz s generated by S- SVM as the source classifiers trained on the source domains: Book, DVDs, and Electronics, respectively. SubFigures 2(d), 2(e) and 2(f) are plots of yz s generated by 2S-SVM as source classifiers from two combinatorial source domains: Book and DVDs, Book and Electronics, and DVDs and Electronics, respectively. Then, SubFigure 2(g) depicts the plot of yz s generated by 3S-SVM (MCC) as the source classifier on all the source domains. In particular, Figure 2 shows the plots of yz s of a particular run using Kitchen Appliances as the target domain with the settings of K = 4, TPCR= and =. The line in each subfigures 2(a) to 2(g) regresses the linear trend of the plots of yz. s From the slope of the lines, it is indicative 2 that the increasing margin ( w ) is associated with a general decrease in the balanced absolute error. These results imply that appropriate choice of source classifiers based on the large target margin criterion can minimize the balance absolute error in the target domain. Further, the cluster assumption made in TOLL to minimize the target generalization absolute error (Theorem 4) by means of maximizing the target margin is valid. In addition, although the plots of yz s in all the SubFigures of Figure 2 share similar range of margin values, their balanced absolute errors can be observed to differ much. This highlights that when no apriori knowledge on choosing the most suitable source domain is available, a strategy of source domains ensemble as proposed in TOLL, thus serves as important for robust accuracy prediction. B. Experimental Result and Discussion on Sentiment Dataset Figures 3 and 4 summarize the balanced absolute error of the target unlabeled data obtained on the Sentiment prediction dataset for K = 2 and K = 4, respectively. The three subfigures at the left denote the results obtained on the target domain for a positive class ratio (TPCR) of, whereas the remaining three subfigures at the right present the results for TPCR of. Subfigures 3(a), 3(d), 4(a) and 4(d) summarize the balanced absolute error of the target domain on the DVDs dataset for varying degrees of in the source domain. On the other hand, subfigures 3(b,e) and 4(b,e), and subfigures 3(c,f) and 4(c,f) display the results for the case where the Electronics and Kitchen Appliances datasets as the target domain of interest, respectively. For the sake of conciseness, the experimental results of the Book dataset as the target domain are omitted from this paper since similar trends to the other datasets studied have been observed on the considered algorithms. In addition, since the results for TPCR= is symmetrical to that of TPCR=, all other target domains for TPCR of are also omitted. As observed from Figure 3, LG-MMC exhibited the worst balanced absolute error among all the methods under investigation. This indicates that unsupervised approach based on maximal margin separation of the unlabeled data without any use of label information, is deemed to be less effective than DA methods due to the abundance of labeled data from other related source domains that can be appropriately used to compliment class predictions on target unlabeled data. The results obtained thus conclude the effectiveness of DA methods

9 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX 9 5 Src B w Src D w Src E w (a) (b) (c) 5 Src B + D w Src B + E w Src D + E w Src B + D +E w (d) (e) (f) (g) Fig. 2. Margin vs. Absolute Error for different source domains while having the Kitchen Appliances as the target domain of interest with a setting of K = 4, TPCR= and =. The points in each subfigures denote the class labels yz s obtained by Algorithm and each yz s has a respective margin obtained 2 from solving min w,ρ,ξ in (5). The x-axis represents the margin, while y-axis denotes the balanced absolute error. B, D, E symbolize the Book, DVDs w and Electronics, respectively. Please refer to the text for more details. on Sentiment data in the absence of target label information 9. It can also be observed from the results in Figures 3 and 4 that the S-SVM underperforms MCC (i.e., 3S-SVM) and 2S- SVM in general. Nevertheless, when source domains are used at equal weights (i.e., 2S-SVM and MCC, respectively) the results in Figures 3 and 4 show significant degradations, due to the imbalanced class ratios between the source and target domains. Note that when the setting approaches either extremes (i.e., and ) of the experimental study considered, the performances of most methods, except TOLL, can be observed to degrade significantly. Since both KMM and TCA operate by minimizing the marginal distribution differences between the target and source domains according to the MMD criterion [7], the degradations in performances thus indicate that the necessary assumption made on similar predictive distributions between source and target domains for KMM and TCA does not hold on the Sentiment data. At the same time, these results also indicate that imbalanced class ratios between the source and target domains do lead to source sample selection bias. From Figure 4, KMM and TCA are noted to have attained lower performances than the others in general. The former method operates by reweighting the source labeled data so as to match the marginal distributions of the target data, while assuming a common class distribution shared by the source and target domains. Hence the poor prediction results are observed when their class distributions are dissimilar. The latter method remaps the kernel space in a way as to minimize the distance between the source and target domain such that samples of star-rating and 2 are reconfigured to be closer together, and similarly for the samples of star-rating 4 and 5. This explains why the performances obtained by TCA, 9 Note that LG-MMC does not appear in Figure 4 (K = 4) since it does not consider ordinal class labels. in Figure 4, is noted to be poor on ordinal problems, while exhibits rewarding results on the binary sentiment problem, as observed in Figure 3. While the performances of the DA methods are observed to suffer from source sample selection bias due to the differing class ratios between the source and target domains, TOLL is observed to perform robustly across the range of and TPCR settings considered. TOLL also attained the lowest balanced absolute error, in relation to all the other methods for the extreme configurations of, i.e., and, as observed from all the subfigures. This implies that TOLL is capable of choosing a robust linear combination of source label vectors that represent the label space of the target unlabeled data. It maximizes the margin of separation solely based on target unlabeled data in the target label space that is spanned by label vectors generated from multiple independent source classifiers (i.e., the bias parameter of each source classifier is projected on the target unlabeled data). It is also worth highlighting that while TCA has outperformed all other algorithms at TPCR of with of on the Sentiment (K = 2) dataset (see Figure 3(a,b,c)), the reported TCA results are chosen from the best among three results, each of which is obtained by applying TCA on different source domains. The three results on different source domains, each of which trained using TCA with DVDs as the target domain, are reported in Figure 5. Furthermore, the balanced absolute error of each source domain trained using SVM, which is denoted as S-SVM, is also depicted in Figure 5. In practice, it is non-trivial to determine in advance which source domain is the most suitable for the target domain beforehand, especially in the absence of prior knowledge on the target domain. TOLL thus fills this gap by providing an ensemble of suitable source classifiers to attain improved predictive performance in the target domain

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX 0 DVDs Electronics Kitchen Appliances r r r S-SVM 2S-SVM MCC LG-MMC KMM TCA TOLL TPCR= (a) (b) 5 (c) r r r TPCR= (d) (e) 5 Fig. 3. Balanced absolute error for K = 2 on Sentiment dataset where left section having target domain s positive class ratio (TPCR) as and the right section is TPCR=. The x-axis is the various source domain s positive class ratio () settings and the y-axis is the balanced absolute error. Please refer to the text for more details. (f) DVDs Electronics Kitchen Appliances S-SVM 2S-SVM MCC KMM TCA T OLL.2 TPCR=.2 (a).2 (b) (c) r.2 TPCR= (d).2 (e).2 (f) Fig. 4. r for K = 4 on Sentiment dataset where left section having target domain s positive class ratio (TPCR) as and the right section is TPCR=. The x-axis is the various source domain s positive class ratio () settings and the y-axis is the balanced absolute accuracy. Please refer to the text for more details. of interest. Therefore, in general, the results in Figure 5 shows that TCA performed much worse than TOLL on settings of, and. C. Newsgroup And Experimental Result Discussions The results for the Newsgroup dataset are reported in Figure 6. We can observe that LG-MMC achieved decent performances on the Newsgroup data. Particularly, LG-MMC reported improved balanced absolute error over S-SVM, 2S-SVM, MCC and KMM for of and in most of the subfigures illustrated. This implies that solely learning from target unlabeled data can sometimes be more beneficial than enlisting the additions of labeled samples from other source domains, especially when the target data is well separated (cluster assumption). S-SVM also operates based on maximizing the margin of separation but training is concentrated on the source domain where source sample selection bias creeps in. On the other hand, KMM improved the results of S-SVM by means of using the Maximum Mean Discrepancy criterion but still fares poorer than LG-MMC. On the other hand, TCA and TOLL achieved significantly lower balanced absolute error than LG-MMC, S-SVM and KMM. In overall, TOLL emerged as superior to all other methods in all experimental settings considered, except on the rec vs. sci task where TPCR=. The details of rec vs. sci task where TPCR= is depicted in Figure 7 and it is observed that

11 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX r 0 Target DVDs TPCR=. S-SVM-Book TCA-Book S-SVM-Electronics 85 TCA-Electronics 80 S-SVM-Kitchen 75 TCA-Kitchen S- 70 SVM- TOLL Book 65 S- Fig. 5. DVDs as Target Domain in Sentiment Experiments for K = 2. Comparisons among S-SVM, TCA and TOLL. S-SVM-X or TCA-X where X refers to the source domain being used to classify DVDs test data. TCA performed lower relative to TOLL if source 2 or source 3 is considered in the training process for classifying the target unlabeled data. Therefore, the selection of appropriate source domains in TCA is an essential task that bears great impacts on its effectiveness. However, in practice, it is difficult to determine the most appropriate source domain for TCA beforehand. In general, the results in Figure 6 showed that TOLL displayed high robustness and superior in prediction accuracy throughout the entire range of and TPCR considered. The results on the dataset are reported in Figure 8. MCC, KMM and TCA reported the best accuracies when is at. However, the accuracies of MCC, KMM and TCA exhibited declining trends as the approaches to. This observation is due to the task of detecting spam s (positive samples) being easier than identifying nonspam s (negative sample). On the other hand, since LG- MMC learns only from target unlabeled data, it does not suffer from source sample selection bias as observed in the figure when much less negative samples are available than positive samples (i.e., of ). Nevertheless, LG-MMC are observed to exhibit poor accuracy across the entire range of s. It is worth mentioning that TCA reported the best accuracy among all methods at of. Therefore, minimizing the marginal distribution differences between the target and source domains, according to the MMD criterion, through finding the transfer components in RKHS do help. Nevertheless, the approach still suffers performance degradations when the class distributions between source and target domains differed. On the other hand, TOLL achieved better performances than all the methods considered for in the range of to and displayed robust results across the entire range of settings. Last but not least, we also did a Wilconxon signed-ranks test [Demsar2006SCC] on all the results on Figures 2, 3, 4, 6 and 8 and reported a 99 D. Comparison of the time complexities of the state-of-the-art methods In what follows, we discuss the theoretical analysis of the following methods. S-SVM, 2S-SVM and MCC use SVM as the classifier, hence they exhibit a time complexity of O(((K )n) 2.3 ), which is assumed as the empirical comp vs. rec comp vs. sci rec vs. sci S-SVM 2S-SVM MCC LG-MMC KMM TCA TOLL TPCR= (a) (b) (c) TPCR= (d) (e) (f) Fig. 6. Newsgroup Experimental Results where left section having target domain s positive class ratio (TPCR) as and the right section is TPCR=. The x-axis is the various source domain s positive class ratio () settings and the y-axis is the balanced absolute accuracy. Please refer to the text for more details. complexity of SVM training, where n is the number of source labeled data. KMM is solved using quadratic-programming with a time complexity of O(n 3 ). TCA, on the other hand, is solved with eigen-decomposition and has a time complexity of O((n + u) 3 ). For TOLL, the computational complexity is O(T J(((K )u) 2.3 )), where K is the number of classes, u is the number of target unlabeled data, while J and T are the number of iterations incurred by the cutting plane method and MKL, respectively. Thus, TOLL takes a factor of JT ( u n )2.3 JT ()u2.3 JT ()u2.3, ( n ) and ( 3 (n+u) ) over MCC, 3 KMM and TCA, respectively. Hence, when the product of J, T, K and u is much greater than n, TOLL will display a higher computational complexity than the other methods. On

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX 2 TABLE II TRAINING TIME(SECONDS) OF VARIOUS METHODS ON THE KITCHEN APPLIANCES(K = 4) DATASET WITH TPCR= AND = S-SVM 2S-SVM MCC KMM TCA TOLL the other hand, when an abundance of source data is available such that n u, the proposed TOLL is faster. To verify the theoretical analysis, an experimental study is carried out to investigate the training times of the methods on Kitchen Appliances(K = 4) as target domain. Note that training size of both S-SVM and KMM is 2000 while training sizes of 2S-SVM, MCC, TCA and TOLL are 4000, 6000, 2500 and 500, respectively. The training times (in term of seconds) of the aforementioned methods are detailed in Table II. Since S-SVM, 2S-SVM, MCC share the same computational complexity, the method with the most training samples is expected to have a longer training time as observed in Table II. Furthermore, as the source labeled data increases, which leads to having a smaller fraction of target unlabeled data to source labeled data, the fraction of training time of TOLL to other method is also expected to be smaller. Therefore, the observations from Table II show that the fraction of training time of TOLL to MCC is smaller than that of TOLL to 2S- SVM and TOLL to -SVM. It is also worth mentioning that the training times of TCA, KMM and S-SVM are consistent with the theoretical aforementioned computational complexities of those methods. Last but not least, TOLL took the longest time to train a classifier. Nevertheless, TOLL is observed to be robust across the 8 tasks as depicted in Figures 3, 4, 6 and 8. VI. CONCLUSION A core challenge of transfer learning in attaining reliable classifier from relevant source domains is the induction of source sample selection bias, such that the eventual classifier trained often steers towards to the distribution of the source domain. And this bias is deemed to become more severe on data involving multiple classes. Taking this cue, we have proposed a Transfer Ordinal Label Learning or TOLL paradigm that predicts the ordinal labels of target unlabeled data by spanning the feasible solution space with ordinal classifiers from multiple relevant source domains. In contrast to previous works, the maximum margins between two consecutive ordinal classes are employed as the criterion for selection and/or fusions of appropriate source ordinal classifiers when designing the target classifier. In this manner, the proposed approach thus learns a target ordinal classifier that involves only the kernel expansion of the target data. Through comprehensive experimental studies, TOLL is shown to display superiority and robustness across the entire range of imbalanced source and target class ratio settings when pitted against several state-of-the-art methods, which is in contrast to other counterpart methods which suffered significantly in the prediction accuracies. Last but not least, TOLL is significantly better than all the compared methods over all the datasets anced Absolute Error Bala 0 Rec vs. Sci TPCR=. S SVM Source TCA Source S SVM Source2 TCA Source2 S SVM Source3SVM Source3 TCA Source3 TOLL Fig. 7. Rec vs. Sci in Newsgroup Experiments. Comparisons among S- SVM, TCA and TOLL. S-SVM-X or TCA-X where X refers to the source domain (see Table I) being used to classify the target test data. MCC LG-MMC KMM TCA TOLL (a) TPCR= (b) TPCR= Fig. 8. Spam Experimental Results. The x-axis is the various source domain s positive class ratio () settings and the y-axis is the balanced absolute accuracy. Please refer to the text for more details. considered in the experimental study based on Wilconxon signed-ranks test [54] with 99% confidence. REFERENCES [] P. Wu and T. G. Dietterich, Improving SVM Accuracy by Training on Auxiliary Data Sources, in ICML, Banff, Alberta, Canada, 2004, pp [2] H. Shimodaira, Improving predictive inference under covariate shift by weighting the log-likelihood function, J. Statist. Plann. Inference, vol. 90, no. 2, pp , [3] M. Sugiyama and K.-R. Müller, Input-dependent estimation of generalization error under covariate shift, Statist. & Decis., vol. 23, no. 4, pp , [4] A. J. Storkey and M. Sugiyama, Mixture regression for covariate shift, in NIPS, British Columbia, Canada, 2006, pp [5] S. Bickel, M. Brückner, and T. Scheffer, Discriminative Learning Under Covariate Shift, JMLR, vol. 0, no. 0, pp , [6] X. Liao, Y. Xue, and L. Carin, Logistic regression with an auxiliary data source, in ICML, Bonn, Germany, 2005, pp [7] J. Huang, A. Smola, A. Gretton, K. M. Borgwardt, and B. Schölkopf, Correcting Sample Selection Bias by Unlabeled Data, in NIPS, Vancouver, British Columbia, Canada, 2006, pp [8] S. Bickel, M. Brckner, and T. Scheffer, Discriminative learning for differing training and test distributions, in ICML, Corvallis, Oregon, USA, 2007, pp [9] M. Sugiyama, M. Krauledat, and K.-R. Müller, Covariate shift adaptation by importance weighted cross validation, JMLR, vol. 8, no. 5, pp , [0] J. Jiang and C. Zhai, Instance weighting for Domain Adaptation in NLP, in ACL, Prague, Czech Republic, 2007, pp [] A. Gretton, K. Borgwardt, M. Rasch, B. Scholkopf, and A. Smola, A kernel method for the two-sample-problem, in NIPS, Vancouver, B.C., Canada, 2007, pp [2] M. Sugiyama, T. Suzuki, and T. Kanamori, Density ratio matching under the Bregman divergence: A unified framework of density ratio estimation, Ann. Inst. Statist. Math., vol., pp. 36, 20. [3] J. Blitzer, R. McDonald, and F. Pereira, Domain Adaptation with Structural Correspondence Learning, in EMNLP, Sydney, Australia, 2006, pp [4] J. Blitzer, M. Dredze, and F. Pereira, Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification, in ACL, Prague, Czech Republic, 2007, pp [5] H. Daumé III, Frustratingly easy domain adaptation, in ACL, Prague, Czech Republic, 2007, pp

Verscheure, Cross domain distribution adaptation via kernel mapping, in KDD, Paris, France, 2009, pp. 027 036. [8] S. J. Pan, I. Tsang, J. Kwok, and Q.

13 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. XX, XXX 20XX 3 [6] W. Dai, O. Jin, G.-R. Xue, Q. Yang, and Y. Yu, Eigentransfer: a unified framework for transfer learning, in ICML, Montreal, Quebec, Canada, 2009, pp [7] E. Zhong, W. Fan, J. Peng, K. Zhang, J. Ren, D. Turaga, and O. Verscheure, Cross domain distribution adaptation via kernel mapping, in KDD, Paris, France, 2009, pp [8] S. J. Pan, I. Tsang, J. Kwok, and Q. Yang, Domain Adaptation via Transfer Component Analysis, TNN, vol. 22, no. 2, pp , 20. [9] S. J. Pan, X. Ni, J.-T. Sun, Q. Yang, and Z. Chen, Cross-domain sentiment classification via spectral feature alignment, in WWW, Raleigh, North Carolina, USA, 200, pp [20] G. Schweikert, C. Widmer, B. Schölkopf, and G. Rätsch, An Empirical Analysis of Domain Adaptation Algorithm for Genomic Sequence Analysis, in NIPS, Vancouver, British Columbia, Canada, 2009, pp [2] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, On combining classifiers, TPAMI, vol. 20, no. 3, pp , 998. [22] X. Shi, Q. Liu, W. Fan, Q. Yang, and P. S. Yu, Predictive Modeling with Heterogeneous Sources, in SDM, Columbus, Ohio,USA, 200, pp [23] L. Duan, D. Xu, and I. W.-H. Tsang, Domain adaptation from multiple sources: A domain-dependent regularization approach, TNNLS, vol. 23, no. 3, pp , 202. [24] R. Herbrich, T. Graepel, and K. Obermayer, Support vector learning for ordinal regression, in ICANN, Edinburgh, 999, pp [25] W. Chu and S. S. Keerthi, New approaches to support vector ordinal regression, in ICML, Bonn, Germany, 2005, pp [26] J. J. Heckman, Sample selection bias as a specification error, Econometrica, vol. 47, no., pp. 53 6, 979. [27] F. Vella, Estimating models with sample selection bias: A survey, J. HUM. Res., vol. 33, no., pp , 998. [28] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence, Dataset Shift in Machine Learning. The MIT Press, [29] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans, Maximum margin clustering, in NIPS, Vancouver, British Columbia, Canada, 2005, pp [30] W.-S. Zheng, S. Gong, and T. Xiang, Quantifying and transferring contextual information in object detection, TPAMI, vol. 34, no. 4, pp , 202. [3] Y.-F. Li, I. W. Tsang, J. T. Kwok, and Z.-H. Zhou, Tighter and convex maximum margin clustering, in AISTATS, Clearwater Beach, Florida, USA, 2009, pp [32] J. J. Lim, R. Salakhutdinov, and A. Torralba, Transfer learning by borrowing examples for multiclass object detection, in NIPS, Granada, Spain, 20, pp [33] A. Ahmed, K. Yu, W. Xu, Y. Gong, and E. Xing, Training hierarchical feedforward visual recognition models using transfer learning from pseudo-tasks, in ECCV, Marseille, France, 2008, pp [34] A. Farhadi, D. A. Forsyth, and R. White, Transfer learning in sign language, in CVPR, Minneapolis, Minnesota, USA, 2007, pp. 8. [35] T. Joachims, Transductive Inference for Text Classification using Support Vector Machines, in ICML, Bled, Slovenia, 999, pp [36] C.-W. Seah, I. W. Tsang, and Y.-S. Ong, Transductive ordinal regression, TNNLS, vol. 23, no. 7, pp , 202. [37] C.-W. Seah, I.-T. Tsang, and Y.-S. Ong, Healing sample selection bias by source classifier selection, in ICDM, Vancouver, BC, Canada, 20, pp [38] L. Li and H.-T. Lin, Ordinal regression by extended binary classification, in NIPS, Vancouver, British Columbia, Canada, 2006, pp [39] J. S. Cardoso and J. F. Pinto da Costa, Learning to classify ordinal data: The data replication method, JMLR, vol. 8, no. 2, pp , [40] P. A. Gutiérrez, M. Pérez-Ortiz, F. Fernández-Navarro, J. Sánchez-Monedero, and C. Hervás-Martínez, An experimental study of different ordinal regression methods and measures, in HAIS (2), 202. [4] C. E. Rasmussen and C. Williams, Gaussian Processes for Machine Learning, [42] S. Boyd and L. Vandenberghe, Convex Optimization, [43] S.-J. Kim and S. Boyd, A Minimax Theorem with Applications to Machine Learning, Signal Processing, and Finance, SIAM J. on Optimization, vol. 9, no. 3, pp , [44] G. Lanckriet, N. Cristianini, P. Bartlett, and L. E. Ghaoui, Learning the kernel matrix with semidefinite programming, JMLR, vol. 5, no., pp , [45] Z. Xu, R. Jin, H. Yang, I. King, and M. R. Lyu, Simple and efficient multiple kernel learning by group lasso. in ICML, Haifa, Israel, 200, pp [46] J. Kelley, J. E., The Cutting-Plane Method for Solving Convex Programs, Society for Industrial and Applied Mathematics, vol. 8, no. 4, pp , 960. [47] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman, Learning bounds for domain adaptation, in Advances in Neural Information Processing Systems, [48] H.-T. Lin and L. Li, Reduction from cost-sensitive ordinal ranking to weighted binary classification, Neural Computation, vol. 24, no. 5, pp , 202. [49] C.-W. Seah, I. W. Tsang, Y.-S. Ong, and K.-K. Lee, Predictive Distribution Matching SVM for Multi-domain Learning, in ECML/PKDD, Barcelona, Spain, 200, pp [50] L. Bruzzone and M. Marconcini, Domain Adaptation Problems: A DASVM Classification Technique and a Circular Validation Strategy, TPAMI, vol. 32, no. 5, pp , 200. [5] J. Jiang and C. Zhai, A two-stage approach to domain adaptation for statistical classifiers, in CIKM, Lisbon, Portugal, 2007, pp [52] N. V. Chawla, N. Japkowicz, and A. Kotcz, Editorial: special issue on learning from imbalanced data sets, SIGKDD Explorations Newsletter, vol. 6, pp. 6, [53] M. Sokolova, N. Japkowicz, and S. Szpakowicz, Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation, Artificial Intelligence, vol. 4304, pp , [54] J. Demvsar, Statistical comparisons of classifiers over multiple data sets, JMLR, vol. 7, no. 2, pp. 30, Chun-Wei Seah received his Ph.D. and B.Eng. (first-class honors) degree in computer science from the School of Computer Engineering, Nanyang Technological University, Singapore, in 2009 and 203, respectively. He is currently a senior member of technical staff at the Defence Science Organisation National Laboratories. His current research interests include transductive learning, transfer learning, rank learning and sentiment prediction. Mr. Seah is a recipient of the Nanyang President s Graduate Scholarship in Ivor W. Tsang received the Ph.D. degree in computer science from the Hong Kong University of Science and Technology, Kowloon, Hong Kong, in He is currently an Assistant Professor with the School of Computer Engineering, Nanyang Technological University (NTU), Singapore. He is the Deputy Director of the Center for Computational Intelligence, NTU. Dr. Tsang received the prestigious IEEE TRANSACTIONS ON NEURAL NET- WORKS Outstanding 2004 Paper Award in 2006 and the 2008 National Natural Science Award (Class II), China, in His co-authored papers also received the Best Student Paper Award at the 23rd IEEE Conference on Computer Vision and Pattern Recognition in 200, the Best Paper Award at the 23rd IEEE International Conference on Tools with Artificial Intelligence in 20, the 20 Best Student Paper Award from PREMIA, Singapore, in 202, and the Best Paper Award from the IEEE Hong Kong Chapter of Signal Processing Postgraduate Forum in He was also conferred with the Microsoft Fellowship in Yew-Soon Ong received the BS and MS degrees in electrical and electronics engineering from Nanyang Technological University (NTU), Singapore, in 998 and 999, respectively. He completed the PhD degree on artificial intelligence in complex design from the Computational Engineering and Design Center, University of Southampton, UK in He is currently an Associate Professor and Director of the Center for Computational Intelligence at the School of Computer Engineering, NTU. Dr. Ong is the founding Technical Editor-in-Chief of Memetic Computing Journal, Chief Editor of the Springer book series on studies in adaptation, learning, and optimization, Associate Editor of IEEE Computational Intelligence Magazine, IEEE Transactions on Systems, Man and Cybernetics Part B, Soft Computing, Information Sciences, International Journal of System Sciences and many others. He also Chairs the IEEE Computational Intelligence Society Emergent Technology Technical Committee and has served as Guest Editors of several journals. His research interest in computational intelligence spans across Memetic Computing, Evolutionary Design, Machine Learning, Agent-based systems and Cloud computing.

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in