arxiv: v3 [stat.ml] 9 Aug 2016

Size: px

Start display at page:

Download "arxiv: v3 [stat.ml] 9 Aug 2016"

Moris Williamson
5 years ago
Views:

1 with Specialization to Linear Classifiers Pascal Gerain Aaury Habrard François Laviolette 3 ilie Morvant INRIA, SIRRA Project-Tea, Paris, France, et DI, École Norale Supérieure, 7530 Paris, France Univ Lyon, UJM-Saint-tienne, CNRS, Institut d Optique Graduate School, Laboratoire Hubert Curien UMR 556, F-403, Saint-tienne, France 3 Départeent d inforatique et de génie logiciel arxiv: v3 statml] 9 Aug 06 Université Laval, Québec, Canada October, 08 This report is a long version of our paper entitled A PAC-Bayesian Approach for Doain Adaptation with Specialization to Linear Classifiers published in the proceedings of the International Conference on Machine Learning ICML 03 We iproved our ain results, extended our experients, and proposed an extension to ultisource doain adaptation Abstract In this paper, we provide two ain contributions in PAC-Bayesian theory for doain adaptation where the objective is to learn, fro a source distribution, a well-perforing ajority vote on a different target distribution On the one hand, we propose an iproveent of the previous approach proposed by Gerain et al 03, that relies on a novel distribution pseudodistance based on a disagreeent averaging, allowing us to derive a new tighter PAC-Bayesian doain adaptation bound for the stochastic Gibbs classifier We specialize it to linear classifiers, and design a learning algorith which shows interesting results on a synthetic proble and on a popular sentient annotation task On the other hand, we generalize these results to ultisource doain adaptation allowing us to take into account different source doains This study opens the door to tackle doain adaptation tasks by aking use of all the PAC-Bayesian tools Introduction As huan beings, we learn fro what we saw before Think about our education process: when a student attends to a new course, he has to ake use of the knowledge he acquired during previous courses However, in achine learning the ost coon assuption is based on the fact that the learning and test data are drawn fro the sae probability distribution This strong assuption ay be clearly irrelevant for a lot of real tasks including those where we desire to adapt a odel fro one task to another one For instance, a spa filtering syste suitable for one user can be poorly adapted to another who receives significantly different eails In other words, the learning data associated with one or several users could be unrepresentative of the test data coing fro another one This enhances the need to design ethods for adapting a classifier fro learning source data to test target data One solution to tackle this issue is to consider the doain adaptation fraework, which arises when the distribution generating the target data the target doain differs fro the one generating the source data the source doain In such a situation, it is well known that doain adaptation is a hard and challenging task even under strong assuptions Ben-David and Urner, 0; Ben-David et al, 00b; Ben-David and Urner, 04 Note that doain adaptation with learning data coing fro different source doains is referred to as ultisource or ultiple sources doain adaptation Craer et al, 007; Mansour et al, 009c; Ben-David et al, 00a See the surveys proposed by Jiang 008; Quionero-Candela et al 009; and Margolis 0

2 Aong the existing approaches in the literature to address doain adaptation, the instance weightingbased ethods allow one to deal with the covariate-shift proble eg, Huang et al, 006; Sugiyaa et al, 008, where source and target doains diverge only in their arginals, ie, they share the sae labeling function Another technique is to exploit self-labeling procedures, where the objective is to transfer the source labels to the target unlabeled points eg, Bruzzone and Marconcini 00; Habrard et al 03; Morvant 04 A third solution is to learn a new coon representation fro the unlabeled part of source and target data Then, a standard supervised learning algorith can be executed on the source labeled instances eg, Glorot et al 0; Chen et al 0 The work presented in this paper stands into a popular class of approaches, which relies on a distance between the source distribution and the target distribution Such distance depends on the set H of hypotheses or classifiers considered by the learning algorith The intuition behind this approach is that one ust look for a set H that iniizes the distance while preserving good perforances on the source data; if the distributions are close under this easure, then generalization ability ay be easier to quantify In fact, ining such a easure to quantify how uch the doains are related is a ajor issue in doain adaptation For exaple, in the context of binary classification with the 0- loss function, Ben-David et al 00a; and Ben-David et al 006 have considered the H H-divergence between the arginal distributions This quantity is based on the axial disagreeent between two classifiers, allowing the to deduce a doain adaptation generalization bound based on the VC-diension theory The discrepancy distance proposed by Mansour et al 009a generalizes this divergence to real-valued functions and ore general losses, and is used to obtain a generalization bound based on the Radeacher coplexity In this context, Cortes and Mohri 0, 04 have specialized the iniization of the discrepancy to regression with kernels In these situations, doain adaptation can be viewed as a ultiple trade-off between the coplexity of the hypothesis class H, the adaptation ability of H according to the divergence between the arginals, and the epirical source risk Moreover, other easures have been exploited under different assuptions, such as the Rényi divergence suitable for iportance weighting Mansour et al, 009b, or the easure proposed by C Zhang 0 which takes into account the source and target true labeling, or the Bayesian divergence prior Li and Biles, 007 which favors classifiers closer to the best source odel However, a ajority of ethods prefer to perfor a two-step approach: i first construct a suitable representation by iniizing the divergence, then ii learn a odel on the source doain in the new representation space The novelty of our contribution is to explore the PAC-Bayesian fraework to tackle doain adaptation in a binary classification situation without target labels soeties called unsupervised doain adaptation Given a prior distribution over a faily of classifiers H, PAC-Bayesian theory introduced by McAllester, 999 focuses on algoriths that output a posterior distribution ρ over H ie, a ρ-average over H rather than just a single classifier h H Following this principle, we propose a pseudoetric which evaluates the doain divergence according to the ρ-average disagreeent of the classifiers over the doains This disagreeent easure shows any advantages First, it is ideal for the PAC-Bayesian setting, since it is expressed as a ρ-average over H Second, we prove that it is always lower than the popular H H-divergence Last but not least, our easure can be easily estiated fro saples Indeed, based on this disagreeent easure, we derived in a previous work Gerain et al, 03 a first PAC-Bayesian doain adaptation bound expressed as a ρ-averaging In this paper, we provide a new version of this result, that does not change the philosophy supported by the previous bound, but clearly iproves the theoretical result: The doain adaptation bound is now tighter and easier to interpret Thanks to this new result, we also derive three new PAC-Bayesian doain adaptation generalization bounds Then, in contrast to the ajority of ethods that perfor a two-step procedure, we design an algorith tailored to linear classifiers, called PBDA, which jointly iniizes the ultiple trade-offs iplied by the bounds The first two quantities being, as usual in the PAC-Bayesian approach, the coplexity of the ajority vote easured by a Kullback-Leibler divergence and the epirical risk easured by the ρ-average errors on the source saple The third quantity corresponds to our doain divergence and assesses the capacity of the posterior distribution to distinguish soe structural difference between the source and target saples Finally, we extend our results to doain adaptation with ultiple sources by considering a ixture of different source doains as done by Ben-David et al 00a The rest of the paper is structured as follows Section deals with two seinal works on doain adaptation The PAC-Bayesian fraework is then recalled in Section 3 Note that for the sake of copleteness, we provide for the first tie the explicit derivation of the algorith PBGD3 Gerain et al, 009a tailored to linear classifiers in supervised learning Our ain contribution, which consists in a In this paper, we were very keen to iprove the readability of our proofs, particularly those provided by Gerain et al 03 as suppleentary aterial The proof techniques ay be of independent interest Technical Report V

3 doain adaptation bound suitable for PAC-Bayesian learning, is presented in Section 4 Then, we derive our new algorith for PAC-Bayesian doain adaptation in Section 5, that we experient in Section 6 Afterwards, we generalize this analysis to ultisource doain adaptation in Section 7 Before concluding in Section 9, we discuss two iportant points in Section 8: i two different results for the ultisource setting that iply open-questions for deriving new algoriths, and ii the coparison between our new result and the one provided in Gerain et al 03 Doain Adaptation Related Works In this section, we review the two seinal works in doain adaptation that are based on a divergence easure between the doains Ben-David et al, 00a; Ben-David et al, 006; Mansour et al, 009a Notations and Setting We consider doain adaptation for binary classification tasks where X R d is the input space of diension d and Y {, +} is the label set The source doain P S and the target doain P T are two different distributions over X Y unknown and fixed, D S and D T being the respective arginal distributions over X We tackle the challenging task where we have no target labels A learning algorith is then provided with a labeled source saple S {x s i, ys i } consisting of exaples drawn iid3 fro P S, and an unlabeled target saple T {x t j } j consisting of exaples drawn iid fro D T Note that, we denote the distribution of a -saple by P S We suppose that H is a set of hypothesis functions for X to Y The expected source error and the expected target error of h H over P S, respectively P T, are the probability that h errs on the entire distribution P S, respectively P T, R PS h L x s,y s 0- hx s, y s, and R PT h L P S x t,y t 0- hx t, y t, P T where L 0- a, b Ia b] is the 0- loss function which returns if a b and 0 otherwise The epirical source error R S on the learning saple S is R S h x s,y s S L 0- hx s, y s The ain objective in doain adaptation is then to learn without target labels a classifier h H leading to the lowest expected target error R PT h We also introduce the expected source disagreeent R DS h, h and the expected target disagreeent R DT h, h of h, h H, which easure the probability that two classifiers h and h do not agree on the respective arginal distributions, and are ined by R DS h, h L x s 0- hx s, h x s and R DT h, h L D S x t 0- hx t, h x t D T The epirical source disagreeent R S h, h on S and the epirical target disagreeents R T h, h on T are R S h, h L 0- hx s, h x s and R T h, h L 0- hx t, h x t x s S Note that, depending on the context, S denotes either the source labeled saple {x s i, ys i } or its unlabeled part {x s i } Note also that the expected error R P h on a distribution P can be viewed as a shortcut notation for the expected disagreeent between a hypothesis h and a labeling function f P that assigns the true label to an exaple description according with respect to P We have x t T R P h R D h, f P L 0- hx, fp x, x D where D is the arginal distribution of P over X 3 iid stands for independent and identically distributed Technical Report V 3

4 Necessity of a Doain Divergence The doain adaptation objective is to find a low-error target hypothesis, even if the target labels are not available ven under strong assuptions, this task can be ipossible to solve Ben-David and Urner, 0; Ben-David et al, 00b However, for deriving generalization ability in a doain adaptation situation with the help of a doain adaptation bound, it is critical to ake use of a divergence between the source and the target doains: the ore siilar the doains, the easier the adaptation appears Soe previous works have proposed different quantities to estiate how a doain is close to another one C Zhang, 0; Ben-David et al, 00a; Mansour et al, 009a,b; Ben-David et al, 006; Li and Biles, 007 Concretely, two doains P S and P T differ if their arginals D S and D T are different, or if the source labeling function differs fro the target one, or if both happen This suggests taking into account two divergences: one between D S and D T and one between the labeling If we have soe target labels, we can cobine the two distances as C Zhang 0 Otherwise, we preferably consider two separate easures, since it is ipossible to estiate the best target hypothesis in such a situation Usually, we suppose that the source labeling function is soehow related to the target one, then we look for a representation where the arginals D S and D T appear closer without losing perforances on the source doain 3 Doain Adaptation Bounds for Binary Classification We now review the first two seinal works which propose doain adaptation bounds based on a arginal divergence First, under the assuption that there exists a hypothesis in H that perfors well on both the source and the target doain, Ben-David et al 00a; and Ben-David et al 006 have provided the following doain adaptation bound Theore Ben-David et al 00a; Ben-David et al 006 Let H be a syetric 4 hypothesis class We have h H, R PT h R PS h + d H HD S, D T + µ h, where d H HD S, D T sup R DT h, h R DS h, h h,h H is the H H-distance between the arginals D S and D T, and µ h R PS h + R PT h is the error of the best hypothesis overall, denoted h, and ined by h argin RPS h + R PT h h H This bound depends on four ters R PS h is the classical source doain expected error d H HD S, D T depends on H and corresponds to the axiu disagreeent between two hypotheses of H In other words, it quantifies how hypothesis fro H can detect differences between these arginals: the lower this easure is for a given H, the better are the generalization guarantees The last ter µ h R PS h + R PT h is related to the best hypothesis h over the doains and act as a quality easure of H in ters of labeling inforation If h does not have a good perforance on both the source and the target doain, then there is no way one can adapt fro this source to this target Hence, as pointed out by the authors, quation, together with the usual VC-bound theory, express a ultiple trade-off between the accuracy of soe particular hypothesis h, the coplexity of H, and the incapacity of hypotheses of H to detect difference between the source and the target doain Second, Mansour et al 009a have extended the H H-distance to the discrepancy divergence for regression and any syetric loss L fulfilling the triangle inequality Given L :, +] R + such a loss, the discrepancy disc L D S, D T between D S and D T is disc L D S, D T sup h,h H Lhx t, h x t Lhx s, h x s x t D T x s D S 4 In a syetric hypothesis space H, for every h H, its inverse h is also in H Technical Report V 4

5 Note that with the 0- loss in binary classification, we have d H HD S, D T disc L0- D S, D T ven if these two divergences ay coincide, the following doain adaptation bound of Mansour et al 009a differs fro Theore Theore Mansour et al 009a Let H be a syetric hypothesis class We have where h H, R PT h R PT h T R DS h S, h + disc L0- D S, D T + ν h S,h T, ν h S,h T R DS h S, h T is the disagreeent between the ideal hypothesis on the target and source doains ined respectively as h T argin R PT h, and h S h H argin R PS h h H In this context, quation can be tighter 5 since it bounds the difference between the target error of a classifier and the one of the optial h T This bound expresses a trade-off between the disagreeent between h and the best source hypothesis h S, the coplexity of H with the Radeacher coplexity, and again the incapacity of hypothesis to detect differences between the doains To conclude, the doain adaptation bounds and suggest that if the divergence between the doains is low, a low-error classifier over the source doain ight perfor well on the target one These divergences copute the worst case of the disagreeent between a pair of hypothesis We propose in Section 4 an average case approach by aking use of the essence of the PAC-Bayesian theory, which is known to offer tight generalization bounds McAllester, 999; Gerain et al, 009a; Parrado-Hernández et al, 0 3 PAC-Bayesian Theory in Supervised Learning Let us now review the classical supervised binary classification fraework called the PAC-Bayesian theory, first introduced by McAllester 999 This theory succeeds to provide tight generalization guarantees on ajority vote classifiers, without relying on any validation set Throughout this section, we adopt an algorith design perspective: we interpret the various fors of the PAC-Bayesian theore as a guide to derive new achine learning algoriths Indeed, the PAC- Bayesian analysis of doain adaptation provided in the forthcoing sections is oriented by the otivation of creating a new adaptive algoriths 3 Notations and Setting Traditionally, the PAC-Bayesian theory considers weighted ajority votes over a set H of binary hypothesis Given a prior distribution π over H and a training set S, the learner ais at finding the posterior distribution ρ over H leading to a ρ-weighted ajority vote B ρ also called the Bayes classifier with good generalization guarantees and ined by B ρ x sign ] hx h ρ Miniizing R PS B ρ the risk of B ρ is known to be NP-hard In the PAC-Bayesian approach, it is replaced by the risk of the stochastic Gibbs classifier G ρ associated with ρ In order to predict the label of an exaple x, the Gibbs classifier first draws a hypothesis h fro H according to ρ, then returns hx as label Note that the error of the Gibbs classifier on a doain P S corresponds to the expectation of the errors over ρ: R PS G ρ R PS h 3 h ρ 5 quation can lead to an error ter 3 ties higher than quation in soe cases Mansour et al, 009a Technical Report V 5

6 In this setting, if B ρ isclassifies x, then at least half of the classifiers under ρ errs on x Hence, we have R PS B ρ R PS G ρ Another result on the relation between R PS B ρ and R PS G ρ is the C-bound of Lacasse et al 006 expressed as RPS G ρ R PS B ρ R DS G ρ, G ρ, 4 where R DS G ρ, G ρ corresponds to the disagreeent of the classifiers over ρ: R DS G ρ, G ρ R D h,h ρ S h, h 5 quation 4 suggests that for a fixed nuerator, ie, a fixed risk of the Gibbs classifier, the best ajority vote is the one with the lowest denoinator, ie, with the greatest disagreeent between its voters see Laviolette et al 0 for further analysis Finally, we introduce the notion of expected joint error of a pair of classifiers h, h drawn according to the distribution ρ, ined as e PS G ρ, G ρ L h,h ρ 0- hx, y L0- h x, y 6 x,y P S The PAC-Bayesian theory allows one to bound the expected error R PS G ρ in ters of two ajor quantities: the epirical error R S G ρ h ρ R S h estiated on a saple S drawn iid fro P S and the Kullback-Leibler divergence KLρ π h ρ ln ρh πh let us recall that π and ρ are respectively the prior and the posterior distributions The three ain PAC-Bayes theores, that we present in the next section, have been proposed by McAllester 999; Seeger 00; Langford 005; and Catoni Three Versions of the PAC-Bayesian Theore First, let us consider the KL-divergence kla b between two Bernoulli distributions with success probability a and b, ined by kla b a ln a a + a ln b b Seeger 00; and Langford 005 have derived the following PAC-Bayesian theore in which the tradeoff between the coplexity and the risk is handled by kl Theore 3 Seeger 00; Langford 005 For any doain P S over X Y, any set of hypotheses H, and any prior distribution π over H, any 0, ], with a probability at least over the choice of S P S, for every ρ over H, we have kl R S G ρ R PS G ρ KLρ π + ln ] This version of the PAC-Bayes theore offers a tight bound, especially for low epirical risk However, due to the kl R S G ρ R PS G ρ ter, this bound reains difficult to interpret: the link between the epirical risk R S G ρ and the true risk R PS G ρ is not given by a close for Thus, fro an algorithic point of view, finding the distribution ρ that iniizes the bound on R PS G ρ given by Theore 3 ight be a difficult task The following version of the PAC-Bayes theore, which was the first proposed McAllester, 999, appears easier to interpret since it links the ters R S G ρ and R PS G ρ by a linear relation Note that Theore 4 can be straightforwardly obtained fro Theore 3 using Pinsker s inequality: q p klq p 7 Theore 4 McAllester 999 For any doain P S over X Y, any set of hypotheses H, any prior distribution π over H, and any 0, ], with a probability at least over the choice of S P S, for every ρ over H, we have R PS G ρ R S G ρ KLρ π + ln ] Technical Report V 6

7 Theores 3 and 4 suggest that, in order to iniize the expected risk, a learning algorith should perfor a trade-off between the epirical risk iniization R S G ρ and KL-divergence iniization KLρ π roughly speaking the coplexity ter The nature of this trade-off can be explicitly controlled in Theore 5 below This PAC-Bayesian result, first proposed by Catoni 007, is ined with a hyperparaeter here naed c It appears to be a natural tool to design PAC-Bayesian algoriths We present this result in the siplified for suggested by Gerain et al 009b Theore 5 Catoni 007 For any doain P S over X Y, for any set of hypotheses H, any prior distribution π over H, any 0, ], and any real nuber c > 0, with a probability at least over the choice of S P S, for every ρ on H, we have R PS G ρ c e c R S G ρ + KLρ π + ln ] c The bound given by Theore 5 has two interesting characteristics First, choosing c, the bound becoes consistent: it converges to R SG ρ + 0] as grows Second, as described in Section 33, its iniization is closely related to the iniization proble associated with the SVM when ρ is an isotropic Gaussian over the space of linear classifiers Gerain et al, 009a Hence, the value c allows us to control the trade-off between the epirical risk R S G ρ and the coplexity ter KLρ π 33 Supervised PAC-Bayesian Learning of Linear Classifiers Let us consider H as a set of linear classifiers in a d-diensional space ach h w H is ined by a weight vector w R d : h w x sgn w x, where denotes the dot product By restricting the prior and the posterior distributions over H to be Gaussian distributions, Langford and Shawe-Taylor 00; Abroladze et al 006; and Parrado-Hernández et al 0 have specialized the PAC-Bayesian theory in order to bound the expected risk of any linear classifier h w H More precisely, given a prior π 0 and a posterior ρ w ined as spherical Gaussians with identity covariance atrix respectively centered on vectors 0 and w, for any h w H, we have and π 0 h w ρ w h w d exp π w d exp π w w An interesting property of these Gaussian distributions is that the prediction of the ρ w -weighted ajority vote B ρw coincides with the one of the linear classifier h w Indeed, we have x X, w H, x,y P S h w x B ρw x ] sign h w x h w ρ w Moreover, the expected risk of the Gibbs classifier G ρw on a doain P S is then given by h w ρ w R PS G ρw L 0- hw x, y x,y P S x,y P S x,y P S x,y P S Φ x,y P S I h w x y h w ρ w I y w x 0 h w ρ w exp π R w w d Pr t N 0, y w x x, t y w x ] x, I y w x 0 d w Technical Report V 7

8 where we ined with rf is the Gauss error function ined as Φa ] a rf, rf b b π Finally, the KL-divergence between ρ w and π 0 becoes siply 33 Objective Function and Gradient KLρ w π 0 w 0 exp t dt 8 Based on the specialization of the PAC-Bayesian theory to linear classifiers, Gerain et al 009a suggested iniizing a PAC-Bayesian bound on R PS G ρw For sake of copleteness, we provide here ore atheatical details than in the original conference paper Gerain et al, 009a We will build on this PAC-Bayesian learning algorith for supervised leaning in our doain adaptation work Given a saple S {x s i, ys i } and a hyperparaeter C > 0, the learning algorith perfors a gradient descent in order to find an optial weight vector w that iniizes F w CR S G ρw + KLρ w π 0 w x i C Φ y i + x i w 9 It turns out that the optial vector w corresponds to the distribution ρ w that iniizes the value of the bound on R PS G ρw given by Theore 5, with the paraeter c of the theore being the hyperparaeter C of the learning algorith It is iportant to point out that PAC-Bayesian theores bound siultaneously R PS G ρw for every ρ w on H Therefore, one can freely explore the doain of objective function F to choose a posterior distribution ρ w that gives, thanks to Theore 5, a bound valid with probability The iniization of quation 9 by gradient descent corresponds to the learning algorith called PBGD3 of Gerain et al 009a The gradient of F w is given the vector F w: F w C Φ w x i yi x i y i x i x i + w, where Φ a π exp a is the derivative of Φ at point a Siilarly to the SVM, the learning algorith PBGD3 realizes a trade-off between the epirical risk expressed by the loss Φ and the coplexity of the learned linear classifier expressed by the regularizer w This siilarity increases when we use a kernel function, as described next 33 Using a kernel function The kernel trick allows to substitute inner products by a kernel function k : R d R d R in quation 9 If k is a Mercer kernel, it iplicitly represents a function φ : X R d that aps an exaple of X into an arbitrary d -diensional space 6, such that x, x X, kx, x φx φx Then, a dual weight vector α α, α,, α R encodes the linear classifier w R d cobination of exaples of S: ] w α i φx i, and thus h w x sgn α i kx i, x as a linear 6 We consider here that the induced space is finite-diensional Technical Report V 8

9 By the representer theore Schölkopf et al, 00, the vector w iniizing quation 9 can be recovered by finding the vector α that iniizes j F α C Φ y α jk i,j i + α i α j K i,j, 0 Ki,i j where K is the kernel atrix of size That is, K i,j kx i, x j The gradient of F α is siply given the vector F α α, α, α, with α # C Φ j y α jk i,j i Ki,i y i K i,# Ki,i Iproving the Algorith Using a Convex Objective α i K i,#, for # {,,, } An annoying drawback of PBGD3 is that the objective function is non-convex and the gradient descent ipleentation needs any rando restarts In fact, we ade extensive epirical experients after the ones described by Gerain et al 009a and saw that PBGD3 achieves an equivalent accuracy and at a fraction of the running tie by replacing the loss function Φ of quations 9 and 0 by its convex relaxation, which is { ax j a } π Φ cvx a Φa, a if a 0, π Φa otherwise The derivative of Φ cvx at point a is then Φ cvxa if a < 0, and Φ a otherwise Note that Figure in Section 5 illustrates the functions Φ and Φ cvx π In the following we present our contributions on PAC-Bayesian doain adaptation 4 The originality of our contribution is to theoretically design a doain adaptation fraework for PAC- Bayesian approach In Section 4, we propose a doain coparison pseudoetric suitable in this context We then derive PAC-Bayesian doain adaptation bounds in Section 4, that iproves the result proposed in Gerain et al 03 Finally, note that in Section 5 we see that using the previous approach in a doain adaptation way is a relevant strategy: we specialize our result to linear classifiers 4 A Doain Divergence for PAC-Bayesian Analysis In the following, while the doain adaptation bounds presented in Section focus on a single classifier, we first ine a ρ-average disagreeent easure to copare the arginals Then, this leads us to derive our doain adaptation bound suitable for the PAC-Bayesian approach As discussed in Section, the derivation of generalization ability in doain adaptation critically needs a divergence easure between the source and target arginals 4 Designing the Divergence We ine a doain disagreeent pseudoetric 7 to easure the structural difference between doain arginals in ters of posterior distribution ρ over H Since we are interested in learning a ρ-weighted ajority vote B ρ leading to good generalization guarantees, we propose to follow the idea behind the C-bound presented in quation 4: given P S, P T, and ρ, if R PS G ρ and R PT G ρ are siilar, then R PS B ρ and R PT B ρ are siilar when R D h,h ρ S h, h and R D h,h ρ T h, h are also siilar Thus, the doains P S and P T are close according to ρ if the divergence between R D h,h ρ T h, h tends to be low Our pseudoetric is ined as follows R D h,h ρ S h, h and 7 A pseudoetric d is a etric for which the property dx, y 0 x y is relaxed to dx, y 0 x y Technical Report V 9

10 Definition Let H be a hypothesis class For any arginal distributions D S and D T over X, any distribution ρ on H, the doain disagreeent dis ρ D S, D T between D S and D T is ined by dis ρ D S, D T R DT h, h R DS h, h ] h,h ρ R DT G ρ, G ρ R DS G ρ, G ρ Note that dis ρ, is syetric and fulfills the triangle inequality 4 Coparison of the H H-divergence and our doain disagreeent While the H H-divergence of Theore is difficult to jointly optiize with the epirical source error, our epirical disagreeent easure is easier to anipulate: we siply need to copute the ρ-average of the classifiers disagreeent instead of finding the pair of classifiers that axiizes the disagreeent Indeed, dis ρ, depends on the ajority vote, which suggests that we can directly iniize it via the epirical dis ρ S, T and the KL-divergence This can be done without instance reweighing, space representation changing or faily of classifiers odification On the contrary, d H H, is a supreu over all h H and hence, does not depend on the h on which the risk is considered Moreover, dis ρ, the ρ-average is lower than the d H H, the worst case Indeed, for every H and ρ over H, we have d H HD S, D T sup h,h H R DT h, h R DS h, h h,h ρ R D T h, h R DS h, h dis ρ D S, D T 43 PAC-Bayesian bounds for our doain disagreeent The following theores show that dis ρ D S, D T can be bounded in ters of the classical PAC-Bayesian quantities: the epirical disagreeent dis ρ S, T estiated on the source and target saples, and the KL-divergence between the prior and posterior distribution on H For the sake of siplicity, let first suppose that, ie, the size of S and T are equal Here is a Seeger s type PAC-Bayesian bound for our doain disagreeent dis ρ Theore 6 For any distributions D S and D T over X, any set of hypotheses H, and any prior distribution π over H, any 0, ], with a probability at least over the choice of S T D S D T, for every ρ on H, we have kl dis ρ S, T + Proof Deferred to Appendix B dis ρ D S, D T + KLρ π + ln ] Here is a McAllester s type PAC-Bayesian bound for our doain disagreeent dis ρ obtained straightforwardly fro Theore 6 Corollary For any distributions D S and D T over X, any set of hypotheses H, and any prior distribution π over H, any 0, ], with a probability at least over the choice of S T D S D T, for every ρ on H, we have dis ρ D S, D T dis ρ S, T KLρ π + ln ] Proof The result is obtained by using Pinsker s inequality quation 7 on Theore 6 Here is a Catoni s type PAC-Bayesian bound which helps us to derive a doain adaptation algorith in the following Technical Report V 0

11 Theore 7 For any distributions D S and D T over X, any set of hypotheses H, any prior distribution π over H, any 0, ], and any real nuber α > 0, with a probability at least over the choice of S T D S D T, for every ρ on H, we have dis ρ D S, D T Proof Deferred to Appendix C α e α dis ρ S, T + KLρ π + ln α ] + Siilarly to the epirical risk bound of Catoni 007 shown by Theore 5, the above doain disagreeent bound is consistent if one puts α Indeed, it converges to dis ρs, T ] as grows The last result of this section tackles the situation where, ie, the sizes of S and T are different Theore 8 For any arginal distributions D S and D T over X, any set of hypotheses H, any prior distribution π over H, any 0, ], with a probability at least over the choice of S D S and T D T, for every ρ over H, we have dis ρd S, D T dis ρ S, T Proof Deferred to Appendix D KLρ π + ln 4 + KLρ π + ln 4 Note that Theore 8 is very siilar to the result of Corollary In fact, in the particular case, Theore 8 differs fro Corollary only by the 4 ter inside the logarith, instead of 4 We now derive our ain result in the following theore: a doain adaptation bound relevant in a PAC- Bayesian setting 4 A doain adaptation bound for the stochastic Gibbs classifier Theore 9 below relies on the doain disagreeent of Definition, and also on expected joint error of quation 6 Theore 9 Let H be a hypothesis class We have ρ on H, R PT G ρ R PS G ρ + dis ρd S, D T + λ ρ, where λ ρ is the deviation between the expected joint errors of G ρ on the target and source doains: λ ρ L h,h ρ 0- hx, y L0- h x, y L 0- hx, y L0- h x, y ] x,y P T x,y P S e PT G ρ, G ρ e PS G ρ, G ρ Proof First, notice that for any distribution P on X Y and corresponding arginal distribution D on X, we have as R P G ρ R DG ρ, G ρ + e P G ρ, G ρ, R P G ρ L h,h ρ 0- hx, y + L0- h x, y ] x,y P h,h ρ x,y P R D G ρ, G ρ + e P G ρ, G ρ L 0- hx, h x + L 0- hx, y L0- h x, y ] Technical Report V

12 Therefore, R PT G ρ R PS G ρ R DT G ρ, G ρ R DS G ρ, G ρ + e PT G ρ, G ρ e PS G ρ, G ρ R DT G ρ, G ρ R DS G ρ, G ρ + e PT G ρ, G ρ e PS G ρ, G ρ dis ρd S, D T + λ ρ Our bound is, in general, incoparable with the ones of Theores and It can be seen as a tradeoff between different quantities The ters R PS G ρ and dis ρ D S, D T are siilar to the first two ters of the doain adaptation bound of Ben-David et al 00a quation : R PS G ρ is the ρ-average risk over H on the source doain, and dis ρ D T, DS easures the ρ-average disagreeent between the arginals but is specific to the current ρ The other ter λ ρ easures the deviation between the expected joint target and source errors of G ρ According to this theory, a good doain adaptation is possible if this deviation is low However, since we suppose that we do not have any label in the target saple, we cannot control or estiate it In practice, we suppose that λ ρ is low and we neglect it In other words, we assue that the labeling inforation between the two doains is related and that considering only the arginal agreeent and the source labels is sufficient to find a good ajority vote Another iportant point coes fro the fact that this bound is not degenerated when the source and target distributions are the sae or close, see Section 8 for a discussion on this point In the next section, we provide three PAC-Bayesian theores that justifies the epirical optiization of the bound of Theore 9 4 PAC-Bayesian theores for doain adaptation Finally, our Theore 9 leads to a PAC-Bayesian bound based on both the epirical source error of the Gibbs classifier and the epirical doain disagreeent pseudoetric estiated on a source and target saples Fro the preceding Seeger s type results, one can then obtain the following PAC-Bayesian doain adaptation bound Theore 0 For any doains P S and P T respectively with arginals D S and D T over X Y, any set of hypotheses H, any prior distribution π over H, and any 0, ], with a probability at least over the choice of S T P S D T, we have R PT G ρ sup R ρ + sup D ρ + λ ρ, where λ ρ is ined by quation, and { R ρ r : kl R S G ρ r KLρ π + ln 4 { D ρ d : kl dis ρs,t + d+ ]}, KLρ π + ln 4 Proof The result is obtained by inserting Theores 3 and 6 with : in Theore 9 The following bound is based on Catoni s approach and corresponds to the one fro which we derive in Section 5 our algorith for PAC-Bayesian doain adaptation Theore For any doains P S and P T resp with arginals D S and D T over X Y, any set of hypotheses H, any prior distribution π over H, any 0, ], any real nubers α > 0 and c > 0, with a probability at least over the choice of S T P S D T, for every posterior distribution ρ on H, we have c R PT G ρ c R S G ρ + α dis ρs, T + c + α KLρ π + ln 3 + λ ρ + α α, ]} where λ ρ is ined by quation, and where c c, and α e c α e α Technical Report V

13 Proof In Theore 9, we replace R S G ρ and dis ρ S, T by their upper bound, obtained fro Theore 5 and Theore 7, with chosen respectively as 3 and 3 In the latter case, we use KLρ π + ln /3 KLρ π + ln 3 < KLρ π + ln 3 We now present a result based on the McAllester bound, which allows us to easily deal with different sizes of saples Theore For any doains P S and P T respectively with arginals D S and D T over X Y, and for any set H of hypotheses, for any prior distribution π over H, any 0, ], with a probability at least over the choice of S P S, S D S, and T D T, for every ρ over H, we have R PT G ρ R S G ρ + dis ρs, T + λ ρ KLρ π + ln where λ ρ is ined by quation KLρ π + ln Proof We insert Theores 4 and 8 with : in Theore 9 KLρ π + ln 8 8, Under the assuption that the doains are soehow related in ters of labeling agreeent on P S and P T for every distribution ρ over H, ie, a low dis ρ D S, D T iplies a negligible λ ρ, a natural solution for a PAC-Bayesian doain adaptation algorith without target label is to iniize the bound of Theore by disregarding λ ρ Notice that a ajor advantage of our doain adaptation bound is that we can jointly optiize the risk and the divergence with a theoretical justification 5 PAC-Bayesian Doain Adaptation Learning of Linear Classifiers In this section, we design a learning algorith for doain adaptation inspired by the PAC-Bayesian learning algorith of Gerain et al 009a That is, we adopt the specialization of the PAC-Bayesian theory to linear classifiers described in Section 33 Note that the code of our algorith is available on-line 8 5 Miniizing the PAC-Bayesian Doain Adaptation Bound Let us consider a prior π 0 and a posterior ρ w that are spherical Gaussian distributions over a space of linear classifiers, exactly as ined in Section 33 Given a source saple S {x s i, ys i } and a target saple T {xt i }, we focus on the iniization of the bound given by Theore We work under the assuption that the ter λ ρw of the bound is negligible Thus, the posterior distribution ρ w that iniizes the bound on R T G ρw is the sae that iniizes C R S G ρw + A dis ρw S, T + KLρ w π 0 3 The values A > 0 and C > 0 are hyperparaeters of the algorith Note that the constants α and c of Theore can be recovered fro any A and C 5 Doain Disagreeent of Linear Classifiers We know fro quation 9 how to copute the ters R S G ρw and KLρ w π 0 of quation 3 Let us now derive the value of dis ρw S, T, ie, the epirical doain disagreeent between S and T of a distribution ρ w over linear classifiers 8 See Technical Report V 3

14 First, for any arginal D, we obtain Thus, where R D G ρw, G ρw x D x D x D x D x D x D Φ L h,h ρ 0- hx, h x w Ihx h x] h,h ρ w h,h ρ w Ihx ] Ih x ] + Ihx ] Ih x ] Ihx ] Ih x ] h,h ρ w Ihx ] Ih x ] h ρw h ρ w w x Φ w x x x dis ρw S, T R S G ρw, G ρw R T G ρw, G ρw w x s Φ i dis x s i 5 Objective Function and Gradient Φ dis w x t i x t i, Φ dis a Φa Φ a 4 Fro the results of Sections 33 and 5, we obtain that quation 3 equals to C Φ yi s w x s i ] w x s x s i + A Φ i w x t dis x s i Φ i dis x t i + w, which is highly non-convex To ake the optiization proble ore tractable, we replace the loss function Φ by its convex relaxation Φ cvx as in Section 333 and iniize the resulting cost function by gradient descent ven if this optiization task is still not convex Φ dis is quasiconcave, our epirical study shows no need to perfor any restarts to find a suitable solution 9 We nae this doain adaptation algorith PBDA To su up, given a source saple S {x s i, ys i }, a target saple T {x t i }, and hyperparaeters A and C, the algorith PBDA perfors gradient descent to iniize the following objective function: Gw C Φ cvx yi s w x s i ] w x s x s i + A Φ i w x t dis x s i Φ i dis x t i + w, 5 where Φa Φ cvx a Φ dis a ax ] a rf, { Φa, Φa Φ a, a }, π with rf the Gauss error function ined in quation 8 Figure illustrates these three functions The gradient Gw of the quation 5 is then given by y Gw C Φ s i w x s i y s i x s i cvx x s i x s i + w ] w x + s A Φ t i x t i w x s dis x t i x t i i x s i Φ dis x s i x s i, 9 We observe epirically that a good strategy is to first find the vector w iniizing the convex proble of PBGD3 described in Section 333, and then use this w as a starting point for the gradient descent of PBDA Technical Report V 4

15 Φa Φ cvx a Φ dis a a Figure : Behavior of functions Φ, Φ cvx and Φ dis where Φ cvxa and Φ dis a are respectively the derivatives of functions Φ cvx and Φ dis evaluated at point a, and ] w x s s sgn Φ i w x t dis x s i Φ i dis x t i We extend these equations to kernels in the following subsection 53 Using a Kernel Function The kernel trick allows us to work with dual weight vector α R that is a linear classifier in an augented space Given a kernel k : R d R d R, we have ] h w x sgn α i kx s i, x + α i+ kx t i, x Let us denote K the kernel atrix of size such as K i,j kx i, x j, where { x # x s # if # otherwise x t # In that case, the objective function of quation 5 is rewritten in ters of the vector α α, α, α as Gα C Φ cvx yi s j α jk i,j Ki,i + A j Φ α jk i,j dis Ki,i Φ dis j α jk i+,j Ki+,i+ ] + j The gradient of the latter equation is given by the vector Gα α, α, α, with j C α jk i,j y s i K i,# + α i K i,# Ki,i Ki,i where α # + s A Φ cvx y s i s sgn Φ dis j α jk i,j Ki,i j j Φ α jk i,j dis Ki,i K i,# Φ Ki,i dis j α jk i+,j Ki+,i+ j Φ α ] jk i+,j dis Ki+,i+ α i α j K i,j ] K i+,#, Ki+,i+ Technical Report V 5

16 6 xperients 6 General Setup PBDA 0 has been evaluated on a toy proble and a sentient dataset For our experients, we iniize the objective function using a Broyden-Fletcher-Goldfarb-Shanno ethod BFGS ipleented in the scipy python library PBDA has been copared with: SVM learned only fro the source doain, ie, without adaptation We ade use of the SVM-light library Joachis, 999 PBGD3, presented in Section 33, and learned only fro the source doain, ie, without adaptation DASVM of Bruzzone and Marconcini 00, an iterative doain adaptation algorith which tries to axiize iteratively a notion of argin on self-labeled target exaples We ipleented DASVM with the LibSVM library Chang and Lin, 00 CODA of Chen et al 0, a co-training doain adaptation algorith, which looks iteratively for target features related to the training set We used the ipleentation provided by the authors Note that Chen et al 0 have shown best results on the dataset considered in our Section 64 ach paraeter is selected with a grid search via a classical 5-folds cross-validation CV on the source saple for PBGD3 and SVM, and via a 5-folds reverse/circular validation RCV on the source and the unlabeled target saples for CODA, DASVM, and PBDA We describe this latter point in the following section Note that for PBDA we search on a 0 0 paraeter grid for a A between 00 and 0 6 and a paraeter C between 0 and 0 8, both on a logarith scale 6 A Note about the Reverse Validation A crucial question in doain adaptation is the validation of the hyperparaeters One solution is to follow the principle proposed by Zhong et al 00 which relies on the use of a reverse validation approach This approach is based on a so-called reverse classifier evaluated on the source doain We propose to follow it for tuning the paraeters of PBDA, DASVM and CODA Note that Bruzzone and Marconcini 00 have proposed a siilar ethod, called circular validation, in the context of DASVM Concretely, in our setting, given k-folds on the source labeled saple S S S k, k-folds on the unlabeled target T saple T T T k and a learning algorith paraetrized by a fixed tuple of hyperparaeters, the reverse cross validation risk on the i th fold is coputed as follows Firstly, the source set S \ S i is used as a labeled saple and the target set T \ T i is used as an unlabeled saple for learning a classifier h Secondly, using the sae algorith, a reverse classifier h r is learned using the self-labeled saple {x, h x} x T \Ti as the source set and the unlabeled part of S \ S i as target saple Finally, the reverse classifier h r is evaluated on S i We suarize this principle on Figure The process is repeated k ties to obtain the reverse cross validation risk averaged across all folds 63 Toy Proble: Two Inter-Twinning Moons The source doain considered here is the classical binary proble with two inter-twinning oons, each class corresponding to one oon Figure 3 We then consider seven different target doains by rotating anticlockwise the source doain according to seven angles fro 0 to 90 The higher the angle, the ore difficult the proble becoes For each doain, we generate 300 instances 50 of each class Moreover, to assess the generalization ability of our approach, we evaluate each algorith on an independent test set of, 000 target points not provided to the algoriths We ake use of a Gaussian kernel for all the ethods ach doain adaptation proble is repeated ten ties, and we report the average error rates on Table Note that since CODA decoposes features for applying co-training, it is not appropriate here we have only two features We reark that our PBDA provides the best perforances except for 50 and 0, indicating that PBDA accurately tackles doain adaptation tasks It shows a nice adaptation ability, especially for the hardest proble, probably due to the fact that dis ρ is tighter and sees to be a good regularizer in a doain adaptation situation The adaptation versus risk iniization trade-off suggested by Theore 0 We ade our code available at the following URL: Available at Technical Report V 6

17 S S T 4 valuation of h on labeled S Learning h Auto-labeling fro SUT of T with h T h r h r S T Learning the r reverse classifier h fro unlabeled S and auto-labeled T Figure : The principle of the reverse/circular validation in our setting Table : Average error rate results for seven rotation angles PBGD3 CV SVM CV DASVM RCV PBDA RCV appears in Figure 3 Indeed, the plot illustrates that PBDA accepts to have a lower source accuracy to aintain its perforance on the target doain, at least when the source and the target doains are not so different Note, however, that for large angles, PBDA prefers to focus on the source accuracy We clai that this is a reasonable behavior for a doain adaptation algorith 64 Sentient Analysis Dataset We consider the popular Aazon reviews dataset Blitzer et al, 006 coposed of reviews of four types of Aazonco c products books, DVDs, electronics, kitchen appliances Originally, the reviews corresponded to a rate between one and five stars and the feature space of unigras and bigras has on average a diension of 00, 000 For sake of siplicity and for considering a binary classification task, we propose to follow a setting siilar to the one proposed by Chen et al 0 Then the two possible classes are: + for the products with a rank higher than 3 stars, for those with a rank lower or equal to 3 stars The diensionality is reduced in the following way: Chen et al 0 only kept the features that appear at least ten ties in a particular DA task it reains about 40, 000 features, and pre-processed the data with a standard tf-idf re-weighting One type of product is a doain, then we perfor twelve doain adaptation tasks For exaple, books DVDs corresponds to the task for which books is the source doain and DVDs the target one The algoriths use a linear kernel and consider, 000 labeled source exaples and, 000 unlabeled target exaples We evaluate the on separate target test sets proposed by Chen et al 0 between 3, 000 and 6, 000 exaples, and we report the results on Table We ake the following observations First, as expected, the doain adaptation approaches provide the best average results Then, PBDA is on average better than CODA, but less accurate than DASVM However, PBDA is copetitive: the results are not significantly different fro CODA and DASVM Moreover, we have observed that PBDA is significantly faster than CODA and DASVM: these two algoriths are based on costly iterative procedures increasing the running tie by at least a factor of five in coparison of PBDA In fact, the clear advantage of PBDA is that we jointly optiize the ters of our bound in one step Technical Report V 7

Figure 3: Illustration of the decision boundary of PBDA on three rotations angles for fixed paraeters A C The two classes of the source saple are green and pink, and target unlabeled saple is gray

18 Figure 3: Illustration of the decision boundary of PBDA on three rotations angles for fixed paraeters A C The two classes of the source saple are green and pink, and target unlabeled saple is gray The botto plot shows corresponding source and target errors We intentionally avoid tuning PBDA paraeters to highlight its inherent adaptation behavior 65 Cobining PBDA and Representation Learning As discussed in the introduction, there exist several failies of approaches used to tackle the doain adaptation proble The present work focuses on the iniization of a distance etric between the source and target distributions Now, we ask ourselves whether it can be fruitful to cobine our PBDA algorith with another approach To do so, we executed PBDA on top of the Marginalized Stacked Denoising Autoencoders SDA introduced by Chen et al 0 In brief, SDA is an unsupervised algorith that learns a new representation of the training saples As a denoising autoencoders algorith, it finds a representation fro which one can approxiately reconstruct the original features of an exaple fro its noisy counterpart The originality of SDA is to learn a representation that allows reconstructing both source and target unlabeled exaples Then, one can execute any supervised learning algorith on the new representation of source saples, for which the labels are known That is, given a source saple S {x s i, ys i } and a target saple T {xt i }, SDA takes the unlabeled parts of S and T, {x s,, x s, x t,, x t }, and learn a feature ap f : X X, where X is a new input space of real-valued vector In Chen et al, 0, a linear SVM is executed using S f {fx s i, ys i } as training data, and the hyper-paraeter C is selected by standard cross-validation We copare the perforance of SVM on SDA representation to PBDA on the sae representations That is, we obtain a new representation of both source S f {fx s i, ys i } and target T f {fx t i } data, using SDA Then, we execute PBDA using S f and T f This coparison is done using the Aazon reviews dataset For the sake of coparison, we used the dataset pre-processed by Chen et al 0, which is slightly different fro the one used in Section 64 Indeed, each doain share the sae 5, 000 features, and no tf-idf re-weighting is applied For each pair source-target, SDA representations are generated using a corruption probability of 50% and a nuber of layers of 5 Then, SVM and PBDA are executed on the sae representations The results are reported in Table 3 The PBDA algorith, when we select the hyperparaeter by reverse cross-validation PBDA RCV, is not always as good as the cross-validated SVM SVM CV However, by looking closer at the results, we notice that there often exists hyperparaeters for which PBDA is better on the testing set than the best achievable SVM as reported by the coluns PBDA T ST and SVM T ST This suggests that it ight be advantageous to ix SDA and PBDA learning strategies However, the hyperparaeters selection is still a challenge in doain adaptation, when we do not have any target labels, even if the reverse cross-validation ethod is a sound strategy For exploratory purposes, we report on Table 3 the risk of PBDA while perforing the odel selection by standard cross-validation PBDA CV and Technical Report V 8

Domain-Adversarial Neural Networks

Domain-Adversarial Neural Networks Doain-Adversarial Neural Networks Hana Ajakan, Pascal Gerain 2, Hugo Larochelle 3, François Laviolette 2, Mario Marchand 2,2 Départeent d inforatique et de génie logiciel, Université Laval, Québec, Canada