Domain-Adversarial Neural Networks

Size: px

Start display at page:

Download "Domain-Adversarial Neural Networks"

Avis Booth
5 years ago
Views:

1 Doain-Adversarial Neural Networks Hana Ajakan, Pascal Gerain 2, Hugo Larochelle 3, François Laviolette 2, Mario Marchand 2,2 Départeent d inforatique et de génie logiciel, Université Laval, Québec, Canada 3 Départeent d inforatique, Université de Sherbrooke, Québec, Canada hana.ajakan.@ulaval.ca, 2 firstnae.lastnae@ift.ulaval.ca, 3 hugo.larochelle@usherbrooke.ca Abstract We introduce a new neural network learning algorith suited to the context of doain adaptation, in which data at training and test tie coe fro siilar but different distributions. Our algorith is inspired by theory on doain adaptation suggesting that, for effective doain transfer to be achieved, predictions ust be ade based on a data representation that cannot discriinate between the training (source) and test (target) doains. We propose a training objective that ipleents this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification target, but uninforative as to the doain of the input. Our experients on a sentient analysis classification benchark, where the target data available at the training tie is unlabeled, show that our neural network for doain adaption algorith has better perforance than either a standard neural networks and a SVM, trained on input features extracted with the state-ofthe-art arginalized stacked denoising autoencoders of Chen et al. (202). Introduction The cost of generating labeled data for a new achine learning task is often an obstacle for applying achine learning ethods. There is thus great incentive to developing ways of exploiting data fro one proble to generalize to another. Doain adaptation focuses on the situation where we have data generated fro two different, but soehow siilar distributions. One exaple is in the context of sentient analysis in written reviews, where we ight want to distinguish between positive fro negative ones. While we ight have labeled data for reviews of one type of product (e.g., ovies), we ight want to be able to generalize to reviews of other products (e.g., books). Doain adaptation achieves such transfer by exploiting an extra set of unlabeled training data, for the new proble to which we wish to generalize (e.g., unlabeled reviews of books). One of the ain approach to achieve such transfer is to learn not just a classifier, but also the representation of the data, in such a way as to favour transfer. A large body of work exists on jointly training a classifier and a representation that are both linear, 2, 3]. However, recent research has shown that non-linear neural networks can also be successful 4]. Specifically, a variant of the denoising autoencoder 5], known as arginalized stacked denoising autoencoders (SDA) 6], have deonstrated state-of-the-art perforance on this proble. By learning a representation that is robust to input corruption noise, they have been shown to learn a representation that is also ore stable across changes of doain and can thus allow cross-doain transfer. In this paper, we propose to encourage stability of representation between doains explicitly into the learning algorith of a neural network. This approach is otivated by theory on doain adaptation 7, 8], that suggests that a good representation for cross-doain transfer is one that is indiscriinate of the doain of origin of the input observation. We show that this principle can be ipleented into a neural network learning objective that includes a ter where the network s hidden layer is working adversarially towards output connections predicting doain ebership. The

2 neural network is then siply trained by gradient-descent learning on this objective. The success of this doain-adversarial neural network (DANN) is confired by extensive experients on a sentient analysis classification benchark, showing that better perforances are achieved copared to a regular neural network and an SVM. Moreover, by training these odels on top of the representation of SDA, our experients also confir that iniizing doain discriinability explicitly is better than only relying on representations that are robust to noise. 2 Doain Adaptation We consider binary classification tasks where X R n is the input space and Y={0, } is the label set. We have two different distributions over X Y called the source doain D S and the target doain D T. A doain adaptation learning algorith is then provided with a labeled source saple S = {(x s i, ys i )} drawn i.i.d. fro D S, and an unlabeled target saple T = {x t i } drawn i.i.d. fro D T. The goal of the learning algorith is to build a classifier η : X Y with a low target risk R DT (η) = def Pr (xt,y t ) D T η(x t ) y t ], while having no inforation about the labels on D T. To tackle this challenging task, any doain adaptation approaches bound the target error by the su of the source error and a notion of distance between the source and the target. These ethods are intuitively justified by a siple assuption: the source risk is expected to be a good indicator of the target risk when both distributions are siilar. Several notions of distance have been proposed for doain adaptation 2, 7, 8, 9, 0]. In this paper, we focus on the H-divergence used by Ben-David et al. 7, 8] (and based on the earlier work of Kifer et al. ]), defined below. Definition (7, 8, ]). Given two doain distributions D S and D T over X, and a hypothesis class H, the H-divergence between D S and D T is d H (D S, D T ) def = 2 sup Pr η(x s ) = ] Pr η(x t ) = ] η H x s D S x t D T. That is, the H-divergence relies on the capacity of the hypothesis class H to distinguish between exaples generated by D S fro exaples generated by D T. Ben David et al. 7, 8] proved that, for a syetric hypothesis class H, one can copute the epirical H-divergence between two saples S (D S ) and T (D T ) by coputing ˆd H (S, T ) def = 2 ( in η H Iη(x s i ) = ] + ]) Iη(x t i) = 0], () where Ia] is the indicator function which is if predicate a is true, and 0 otherwise. Ben David et al. 7, 8] suggested that, even if it is generally hard to copute ˆd H (S, T ) exactly (e.g., when H is the space of linear classifiers on X ), we can easily approxiate it by running a learning algorith on the proble of discriinating between source and target exaples. To do so, we construct a new dataset U = {(x s i, )} {(xt i, 0)} where the exaples of the source saple are labeled and the exaples of the target saple are labeled 0. Then, the risk of the classifier trained on new dataset U approxiates the in part of Eq. (). Ben David et al. 7, 8] also showed that d H (D S, D T ) is upper bounded by its epirical estiate ˆd H (S, T ) plus a constant coplexity ter that depends of the VC-diension of H and the size of saples S and T. By cobining this result with a siilar bound on the source risk, the following theore is obtained. Theore 2 (Ben David et al., ]). Let H be a hypothesis class of VC-diension d. With probability δ over the choice of saples S (D S ) and T (D T ), for every η H: R DT (η) R S (η) + 4 d log 2e d + log 4 δ + ˆd H (S, T ) + 4 d 2 log 2 d + log 4 δ + β, with β inf R η D S (η )+R DT (η )], and R S (η)= H Iη(x s i ) yi s ] is the epirical source risk. For siplicity, we consider through this paper that the source saple S and the target saple T are of equal size. It is easy to generalize the results for the case S T. 2

3 The previous result tells us that R DT (η) can be low only when the β ter is low (i.e., only when there exists a classifier that can achieve a low risk on both distributions). It also tells us that, to find a classifier with a sall R DT (η) in a given class of fixed VC diension, the learning algorith should iniize (in that class) a trade-off between the source risk R S (η) and the H-divergence ˆd H (S, T ). As pointed-out by Ben David et al. 7], a strategy to control the H-divergence is to find a representation of the exaples where both the source and the target doain are as indistinguishable as possible. Under such a representation, a hypothesis with a low source risk will, according to Theore 2, perfor well on the target data. We now present a learning algorith based on this idea. 3 A Doain-Adversarial Neural Network (DANN) The originality of our approach is to explicitly ipleent the idea exhibited by Theore 2 into a neural network classifier (note that the HMM representation learning ethod for doain adaptation of Huang and Yates (202) 2] is also inspired by the H-divergence of Ben-David et al.). That is, to learn a odel that can generalize well fro one doain to another, we ensure that the internal representation of the neural network contains no discriinative inforation about the origin of the input (source or target), while preserving a low risk on the source (labeled) exaples. Let us consider the following standard neural network architecture with one hidden layer: h(x) = sig(b + Wx), and f(x) = softax(c + Vh(x)), (2) ] ] with sig(a) def a a =, and softax(a) def = +exp( a i) in W,V,b,c exp(a i) a j= exp(aj) Given a training source saple S = {(x s i, ys i )} (D S), the natural classification loss to use is the negative log-probability of the correct label. This leads to the following learning proble: ] log(f yi s(xs i )), (3). where f y (x) denotes the conditional probability that the neural network assigns x to class y. Given a W and b obtained by solving Eq. (3), we view the output of the hidden layer h( ) (Eq. (2)) as the internal representation of the neural network. We denote the source saple representations h(s) = {h(x s i )}. Now, consider an unlabeled saple T = {xt i } (D T) fro the target doain, and the corresponding representations h(t ) = {h(x t i )}. Based on Eq. (), the epirical H-divergence of a syetric hypothesis class H between saples h(s) and h(t ) is given by ˆd H (h(s), h(t )) = 2 ( in η H I η(h(x s i )) = ] + I η(h(x t i)) = 0 ]]). (4) Let us consider H as the class of hyperplanes in the representation space. We suggest estiating the in part of Eq. (4) by a logistic regressor that odel the probability that a given input (either x s or x t ) is fro the source doain D S (denoted z =) or the target doain D T (denoted z =0): p(z = φ) = o(φ) def = sig(d + w φ), where φ is either an output fro h(x s ) or fro h(x t ). Then, this enables us to add a doain adaptation ter to the objective of Eq. (3), giving the following proble to solve: in W,V,b,c log ( f y s i (x s i ) ) +λ ax w,d ( log ( o(h(x s i )) ) + log ( o(h(x t i)) ))], (5) where the paraeter λ > 0 weights the doain adaptation regularization ter. This optiization proble is otivated by Theore 2, as it ipleents a trade-off between the iniization of the source risk R S ( ) and the divergence ˆd H (, ). We introduced the paraeter λ to tune the trade-off between these two quantities during the learning process. We see that Eq. (5) involves a axiization operation. Hence, the neural network (paraetrized by {W, V, b, c}) and the doain classifier (paraetrized by {w, d}) are copeting against each other, in an adversarial way, for that ter. In other words, the hidden layer (given by h( )) aps an exaple (either source or target) into a representation in which the output layer (given by f( )) accurately classifies the source saple while the adaptation coponent (given by o( )) is unable to detect if an exaple belongs to the source saple or the target saple. To optiize Eq. (5), we perfor stochastic gradient descent, as detailed in Appendix A. 3

4 Table : Error rates on the Aazon reviews dataset (left), and Pairwise Poisson binoial test (right). Original data SDA representations Nae DANN NN SVM DANN NN SVM books dvd books electronics books kitchen dvd books dvd electronics dvd kitchen electronics books electronics dvd electronics kitchen kitchen books kitchen dvd kitchen electronics Original data DANN NN SVM DANN NN SVM SDA representations DANN NN SVM DANN NN SVM Experients In this section, we copare the perforance of our proposed DANN algorith to a standard neural network with one hidden layer (NN) described by Eq. (3), and a Support Vector Machine (SVM) with a linear kernel. To select the hyper-paraeters of each these algoriths, we train the using a paraeter grid, and we use a very sall validation set which consists in 00 labeled exaples fro the target doain. Finally, we select the classifiers having the lowest target validation risk. We detailed the training procedure for each algorith in Appendix B. Sentient analysis dataset. We copare our algoriths on the Aazon reviews dataset, as preprocessed by Chen et al. (202) 6]. This dataset includes four doains, each one coposed by reviews of a specific kind of product (books, dvd disks, electronics, and kitchen appliances). Reviews are encoded in 5, 000 diensional feature vectors of unigras and bigras, and labels are binary: 0 if the product is ranked up to 3 stars, and if the product is ranked 4 or 5 stars. We perfor twelve doain adaptation tasks. For exaple, books dvd corresponds to the task for which books is the source doain and dvd disks the target one. All learning algoriths are given 2, 000 labeled source exaples and 2, 000 unlabeled target exaples. Then, we evaluate the on separate target test sets (between 3, 000 and 6, 000 exaples). The Original data part of Table (left) shows the test risk of all algoriths, and Table (right) reports the probability than one algorith is significantly better than another according to the Poisson binoial test 3]. We note that DANN has a significantly better perforance than NN and SVM, with respective probabilities 0.90 and As the only difference between DANN and NN is the DA regularizer, we conclude that our approach successfully helps to find a representation suitable for the target doain. Cobining DANN with autoencoders. We now wonder whether our DANN algorith can iprove on the representation learned by state-of-the-art Marginalized Stacked Denoising Autoencoders (SDA) 6]. In brief, SDA is an unsupervised algorith that learn a new robust features representations of the training saples. It takes the unlabeled parts of union of the source set S and the target set T to learn a feature ap fro input space X to a new representation space X. As a denoising autoencoders algorith, it finds a feature representation fro which one can (approxiately) reconstruct the original features of an exaple fro its noisy counterpart. Chen et al. (202) 6] shows that using SDA with linear SVM classifier perfors well on the Aazon reviews datasets on the new representation of the source saple. As an alternative to SVM, we propose to apply DANN algorith on the sae representations generated by SDA (using representations of both source and target saples). Note that, even if SDA and DANN are two representation learning approaches, they pursued a different strategy that can be copleentary. In this experientation, we generate the SDA representations using a corruption probability of 50% and a nuber of layers of 5 for each pair of doains source-target, using the sae aazon reviews described earlier. We then execute the three learning algoriths (DANN, NN, and SVM) on the top of these representations. The SDA part of Table (left and right) confirs that cobining SDA and DANN is a sound approach. Indeed, the Poisson binoial test shows that DANN has a better perforance than NN and SVM with probabilities 0.82 and 0.88 respectively. 4

5 References ] L. Bruzzone and M. Marconcini. Doain adaptation probles: A DASVM classification technique and a circular validation strategy. Transaction Pattern Analysis and Machine Intelligence, 32(5): , ] P. Gerain, A. Habrard, F. Laviolette, and E. Morvant. A PAC-Bayesian approach for doain adaptation with specialization to linear classifiers. In ICML, pages , ] C. Cortes and M. Mohri. Doain adaptation and saple bias correction theory and algorith for regression. Theor. Coput. Sci., 59:03 26, ] X. Glorot, A. Bordes, and Y. Bengio. Doain adaptation for large-scale sentient classification: A deep learning approach. In ICML, volue 27, pages 97 0, 20. 5] P. Vincent, H. Larochelle, Y. Bengio, and P. A. Manzagol. Extracting and coposing robust features with denoising autoencoders. In ICML, pages , ] M. Chen, Z. E. Xu, K. Q. Weinberger, and F. Sha. Marginalized denoising autoencoders for doain adaptation. In ICML, ] S. Ben-David, J. Blitzer, K. Craer, and F. Pereira. Analysis of representations for doain adaptation. In NIPS, pages 37 44, ] S. Ben-David, J. Blitzer, K. Craer, A. Kulesza, F. Pereira, and J.W. Vaughan. A theory of learning fro different doains. Machine Learning, 79(-2):5 75, ] Y. Mansour, M. Mohri, and A. Rostaizadeh. Doain adaptation: Learning bounds and algoriths. In COLT, pages 9 30, ] Y. Mansour, M. Mohri, and A. Rostaizadeh. Multiple source adaptation and the Rényi divergence. In UAI, pages , ] D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in data streas. In Very Large Data Bases, ] F. Huang and A. Yates. Biased representation learning for doain adaptation. In Joint Conference on Epirical Methods in Natural Language Processing and Coputational Natural Language Learning, pages , ] A. Lacoste, F. Laviolette, and M. Marchand. Bayesian coparison of achine learning algoriths on single and ultiple datasets. In AISTATS, pages ,

6 A Learning Algorith Algorith DANN Stochastic training update Input: source saple S = {(x s i, y s i )}, target saple T = {x t i}, hidden layer size l, adaptation paraeter λ, learning rate α. Output: neural network {W, V, b, c} W, V rando init( l ) b, c, w, d 0 while stopping criteria is not et do for i fro to do # Forward propagation h(x s i ) sig(b + Wx s i ) f(x s i ) softax(c + Vh(x s i )) # Backpropagation c (e(y s i ) f(x s i )) # e(y) is a one-hot vector, that is all 0s but with a at position y V c h(x s i ) b ( V c ) h(x s i ) ( h(x s i )) # is the eleent-wise product W b (x s i ) # Add doain adaptation regularizer... #... fro current doain o(x s i ) sig(d + w h(x s i )) d λ( o(x s i )) w λ( o(x s i ))h(x s i ) tp λ( o(x s i ))w h(x s i ) ( h(x s i )) b b + tp W W + tp (x s i ) #... fro other doain j unfor integer(,..., ) h(x t j) sig(b + Wx t j) o(x t j) sig(d + w h(x t j)) d d λo(x t j) w w λo(x t j)h(x t j) tp λo(x t j)w h(x t j) ( h(x t j)) b b + tp W W + tp (x t j) # Update paraeters neural network paraeters W W α W # α is a hyper-paraeter learning rate V V α V b b α b c c α c # Update doain classifier paraeters w w + α w # Notice the + instead of the - d d + α d end for end while return {W, V, b, c} B Epirical Experients Details Here is soe details about the procedure we used to execute the learning paraeters. DANN. The adaptation paraeter λ is chosen aong 9 values between 0 2 and on a logarithic scale. The hidden layer size l is either, 5, 2, 25, 50, 75, 00, 50, or 200. Finally, the learning rate α is fixed at 0 3 when learning on original data and 0 4 when learning on SDA representations. For each learning task, we split the source training set to use 90% as the training set S and the reaining 0% as a validation set S V. We stop the learning process when the risk on S V is inial. 6

7 NN. We use exactly the sae hyper-paraeters and training procedure as DANN above, except that we do not need an adaptation paraeter. Note that one can train NN by using DANN ipleentation (Algorith ) with λ = 0. SVM. The hyper-paraeter C of the SVM is chosen aong 0 values between 0 5 and on a logarithic scale. Note that this range of values is the sae uded by Chen et al.(202) 6] in their experientations. 7

Domain-Adversarial Neural Networks

Domain-Adversarial Neural Networks Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand Département d informatique et de génie logiciel, Université Laval, Québec, Canada Département