Two transfer learning approaches for domain adaptation

Size: px

Start display at page:

Download "Two transfer learning approaches for domain adaptation"

Hortense Robertson
5 years ago
Views:

1 Two transfer learning approaches for domain adaptation Amaury Habrard Laboratoire Hubert Curien, UMR CNRS 5516 University Jean Monnet - Saint-Étienne (France) Seminar, University of Alicante 15/05/2014

Introduction and Motivation When do we need Domain

The Learning distribution is different from the Testing

images from a Web image corpus Is there a Person in

2 Introduction and Motivation When do we need Domain Adaptation (DA)? The Learning distribution is different from the Testing distribution An example of a DA task We have labeled images from a Web image corpus Is there a Person in unlabeled images from a Video corpus?? Person no Person Is there a Person? How can we learn, from one distribution, a low-error classifier on another distribution? Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

3 Outline 1 Domain Adaptation - Introduction 2 What the theory says 3 Unsupervised Visual Domain Adaptation Using Subspace Alignment - ICCV 2013 Joint work with B. Fernando and T. Tuytelaars (K.U. Leuven), and M. Sebban 4 A PAC-Bayesian Approach for Domain Adaptation (PBDA) - ICML 13 Joint work with P. Germain and F. Laviolette (U. Laval, Canada), and E. Morvant (LIF, Marseille - now post-doc IST Austria) 5 Conclusion Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

4 Outline 1 Domain Adaptation - Introduction 2 What the theory says 3 Unsupervised Visual Domain Adaptation Using Subspace Alignment - ICCV 2013 Joint work with B. Fernando and T. Tuytelaars (K.U. Leuven), and M. Sebban 4 A PAC-Bayesian Approach for Domain Adaptation (PBDA) - ICML 13 Joint work with P. Germain and F. Laviolette (U. Laval, Canada), and E. Morvant (LIF, Marseille - now post-doc IST Austria) 5 Conclusion Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

domain annotated, A and Sample hope the in domain classifiers B still work well Sample on mobile in domain camera C images > No [Gonq et al.

5 Traditional Machine Learning Intuition and motivation from a CV perspective (I m not an expert) %&%'(#)*+#,-%(.*/.#"(%(0*!"#(12."*/.#"(%(0* Training and test data are from the same domain Training and test data are from different domains Can we train classifiers with Flickr photos, as they have already been collected and le in domain annotated, A and Sample hope the in domain classifiers B still work well Sample on mobile in domain camera C images > No [Gonq et al., CVPR 2012] (Figures are adapted from Pan) object classifiers optimized on benchmark dataset often exhibit significant degradation in recognition accuracy when evaluated on another one [Gonq et al.,icml 2013, Torralba et al., CVPR 2011, Perronnin et al., CVPR 2010] Important topic in many areas: tutorial at ICML 2010, CVPR 2012, Interspeech 2012, workshops at ICCV 2013, NIPS Sessions dedicated to transfer/domain adaptation in many top conferences. Hot topic. Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

6 Hard to predict what will change in the new domain Dilemma: Hard to predict what will change in new domain high quality low quality daylight sunset posed!"#$%&'$(")*+ art surveillance Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

7 ution: Problems work with data representations in feature space! digital SLR webcam!"#$%!,$-% &'%()%*++%!"##$%$&'( )"*$&+",&+( &'%()%.+++% Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

8 Problems in NLP - sentiment analysis critiques de livres??? Exemple critiques de films -1??? +1??? Algorithme d'apprentissage +1??? -1??? Classificateur +1-1 Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

9 Task and notations Binary classification task: X input space ; Y = { 1, +1} output space Supervised Classification P S source domain: distribution over X Y ; D S marginal distribution over X S = {(x s i, yi s )} ms i=1 (P S) ms a labeled source sample Objective: Find a classifier h H with a low source error R PS (h)= E I [ h(x s ) y s] (x s,y s ) P S Supervised Classification Distribution P S Labeled Sample From P S Learning Model Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

10 Task and notations Binary classification task: X input space ; Y = { 1, +1} output space Supervised Classification P S source domain: distribution over X Y ; D S marginal distribution over X S = {(x s i, yi s )} ms i=1 (P S) ms a labeled source sample Objective: Find a classifier h H with a low source error R PS (h)= E I [ h(x s ) y s] (x s,y s ) P S Domain Adaptation P T target domain: distribution over X Y ; D T marginal distribution over X T = {x t j } mt j=1 (D T ) mt an unlabeled target sample Objective: Find a classifier h H with a low target error R PT (h)= E I [ h(x t ) y t] (x t,y t ) P T Different Distribution P T Domain Adaptation Distribution P S Unlabeled Sample From D T Labeled Sample From P S Learning Model Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

11 Outline 1 Domain Adaptation - Introduction 2 What the theory says 3 Unsupervised Visual Domain Adaptation Using Subspace Alignment - ICCV 2013 Joint work with B. Fernando and T. Tuytelaars (K.U. Leuven), and M. Sebban 4 A PAC-Bayesian Approach for Domain Adaptation (PBDA) - ICML 13 Joint work with P. Germain and F. Laviolette (U. Laval, Canada), and E. Morvant (LIF, Marseille - now post-doc IST Austria) 5 Conclusion Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

12 A first insight R PT (h) = E I [ h(x t ) y t] P S (x t, y t ) = E (x t,y t ) P T (x t,y t ) P T P S (x t, y t ) I[ h(x t ) y t] = (x t,y t ) P T (x t, y t ) P S(x t, y t ) P S (x t, y t ) I[ h(x t ) y t] = E (x t,y t ) P S P T (x t, y t ) P S (x t, y t ) I[ h(x t ) y t] If the tasks are similar, P S (y t x t ) = P T (y t x t ) (covariate shift assumption) D T (x t )P T (y t x t ) = E (x t,y t ) P S D S (x t )P S (y t x t ) I[ h(x t ) y t] = E (x t,y t ) P S D T (x t ) D S (x t ) I[ h(x t ) y t] Idea learn an estimate of D T (x t ) D S, htne learn a classifier source data, but: (x t ) The tasks are not similar in general. This analysis does not take into account the hypothesis space considered. Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

13 Domain Adaptation Theory Necessity of a Domain Divergence Labeled Sample S = {(x s i, y s i )} ms i=1 Source sample drawn i.i.d. from P S Unlabeled Sample T = {x t i } mt i=1 Target Sample drawn i.i.d. from D T If h is learned from source domain, how does it perform on target domain? = If the domains are close then a low source error classifier could be a low target error classifier Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

14 Domain Adaptation Theory S. Ben-David et al. Result Theorem [Ben-David et al.,mlj 10,NIPS 06] Let H a symmetric hypothesis space. If D S and D T are respectively the marginal distributions of source and target instances, then for all δ (0, 1], with probability at least 1 δ : h H, R PT (h) R PS (h) d H H(D S, D T ) + ν Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

(h) R PS (h) + 1 2 d H H(D S, D T ) + ν R PS (h): Classical expected error on the source domain d H H (D S, D T ): The H H-divergence 1 2 d H H(D S, D T ) =

15 Domain Adaptation Theory S. Ben-David et al. Result Theorem [Ben-David et al.,mlj 10,NIPS 06] Let H a symmetric hypothesis space. If D S and D T are respectively the marginal distributions of source and target instances, then for all δ (0, 1], with probability at least 1 δ : h H, R PT (h) R PS (h) d H H(D S, D T ) + ν R PS (h): Classical expected error on the source domain d H H (D S, D T ): The H H-divergence 1 2 d H H(D S, D T ) = sup R DT (h, h ) R DS (h, h ) (h,h ) H 2 = sup E I [ h(x t ) h (x t ) ] E x t D T (h,h ) H 2 x s D S I [ h(x s ) h (x s ) ] Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

If DS and DT are respectively the marginal distributions of source and target instances,

) + ν 2 ν = inf h0 H RPS (h0 ) + RPT (h0 ) : Error of the joint optimal classifier or ν =

16 Domain Adaptation Theory S. Ben-David et al. Result Theorem [Ben-David et al.,mlj 10,NIPS 06] Let H a symmetric hypothesis space. If DS and DT are respectively the marginal distributions of source and target instances, then for all δ (0, 1], with probability at least 1 δ : h H, RPT (h) RPS (h) + 1 dh H (DS, DT ) + ν 2 ν = inf h0 H RPS (h0 ) + RPT (h0 ) : Error of the joint optimal classifier or ν = RPT (ht ) + RPT (ht, hs ), hd best hypothesis on domain D [Mohri et al, 2009] Small ν Amaury Habrard (LaHC) Large ν Domain Adaptation 15/05/ / 32

17 Domain Adaptation Theory S. Ben-David et al. Result Theorem [Ben-David et al.,mlj 10,NIPS 06] Let H a symmetric hypothesis space. If D S and D T are respectively the marginal distributions of source and target instances, then for all δ (0, 1], with probability at least 1 δ : h H, R PT (h) R PS (h) d H H(D S, D T ) + ν Idea : Minimize the bound : reweighted methods, new projection space, new feature-based representation. Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

18 Illustration of the main methods Instance weighting Sample bias Covariate shift PAC Bayesian Theory Statistical Learning Theory Metric Learning Subspace Alignement Latent Pattern Mining Feature Representation Domain invariant features Latent features Iterative Models EM based methods Self training Boosting based models Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

19 Outline 1 Domain Adaptation - Introduction 2 What the theory says 3 Unsupervised Visual Domain Adaptation Using Subspace Alignment - ICCV 2013 Joint work with B. Fernando and T. Tuytelaars (K.U. Leuven), and M. Sebban 4 A PAC-Bayesian Approach for Domain Adaptation (PBDA) - ICML 13 Joint work with P. Germain and F. Laviolette (U. Laval, Canada), and E. Morvant (LIF, Marseille - now post-doc IST Austria) 5 Conclusion Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

20 Related 2 31"*445(0)#1$4"1$"%1((1$".1(*#$"6'*)5&'47" work - look for intermediate representations )80'"16")&*$461&(*)#1$" " "#$%&'()&%$*'+,-./#/01223 [Gopalan et al., ICCV 11] Project the data in a common subspace on the geodesic path between source and target subspaces Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

21 An algorithmic approach - Align the two subspaces - Principle Very simple method Totally unsupervised Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

22 Algorithm Algorithm 1: Subspace alignment DA algorithm Data: Source data S, Target data T, Source labels L S, Subspace dimension d Result: Predicted target labels L T X S PCA(S, d) (source subspace defined by the first d eigenvectors) ; X T PCA(T, d) (target subspace defined by the first d eigenvectors); X a X S X SX T (operator for aligning the source subspace to the target one); S a = SX a (new source data in the aligned space); T T = TX T (new target data in the aligned space); L T Classifier(S a, T T, L S ) ; the term M = X SX T corresponds to the subspace alignment matrix : M = argmin M X S M X T X a = X S X SX T = X S M projects the source data to the target subspace A natural similarity: Sim(x s, x t) = x sx S M X T x t = x sax t Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

X d D X D Some results From the Figure 1. Classifying ImageNet images using Caltech-256 images as the source domain. In the first row, we show an ImageNet query lemma for th Adaptationimage.

23 X d D X D Some results From the Figure 1. Classifying ImageNet images using Caltech-256 images as the source domain. In the first row, we show an ImageNet query lemma for th Adaptationimage. fromin Office/Caltech-10 the second row, the nearest datasets neighbour (fourimage domains selected to by adapt) For the sake our method is shown. the same not Adaptation on ImageNet, LabelMe and Caltech-256 datasets : one is ple used D as (resp. source and one as target Comparisons positive semidefinite) where A encodes the relative contri- 1: projection of theon different the source components subspace of the vectors in their the orthogon Lemma 1. L Baselinebutions Baselineoriginal 2: projection space. on the target subspace first d eigenv 2 related methods : GFK [Gong et al., CVPR 12] and GFS [Gopalan et al.,iccv 11] We use Sim(y S, y T ) directly to perform a k-nearest λ d >λ d+1 ( neighbor classification task. On the other hand, since any n Sim(y S, y T ) is not PSD we can not make use of it to learn ( Amaury Habrard asvmdirectly. (LaHC) Aswewillseeintheexperimentalsec- Domain Adaptation 15/05/2014 at least 181/ 32 ( any n at least 1 (

GFK [7] 36.9 32.5 31.1 35.6 29.8 27.2 OUR 39.0 38.0 37.4 35.3 32.4 32.3 Method A D C D W D A W C W D W Result Method tables NA 22.4 21.7 40.5 23.3 20.0 53.0 A D C D W D A W C W D W NA 22.4 21.7 40.5 23.3 20.0 53.0 Baseline 1 34.

24 GFK [7] OUR Method A D C D W D A W C W D W Result Method tables NA A D C D W D A W C W D W NA Baseline Office/Caltech-10 Baseline datasets with domains 74.0 Baseline A, B, C,D Baseline GFS [8] Figure 2. Finding GFS Method [8] a stable 30.7solution 32.6 and 54.3 a subspace 31.0 dimensionality GFK [7] C A D A W A A C D C W C using the GFK consistency [7] theorem OUR NA OUR Table 2. Recognition accuracy with unsupervised DA using a NN Baseline Method Table 2. Recognition NA Baseline accuracy 1 with Baseline unsupervised 2 GFKDA using OUR a NN classifier (Office dataset + Caltech10). Baseline TDAS classifier (Office 1.25 dataset Caltech10) GFS [8] HΔH Method C A D A W A A C D C W C Table 1. Several GFK [7] Methoddistribution C A discrepancy D A W Ameasures A C averaged D C W C over Baseline DA problems OUR using 39.0 Office dataset r Baseline Baseline Method A D C D W D A W C W D W Baseline GFK GFK NA OUR pared tobaseline the other 1 baselines (highest 71.8 TDAS 35.1value 33.5 and lowest HΔHBaseline measure) Both GFK 36.4 and 72.9 our method have lower 78.4 OUR Method A D C D W D A W C W D W - Method A D C D W D A W C W D W Baseline r HΔH values GFS [8] meaning 30.7that 32.6 these 54.3 methods 31.0 are more 30.6 likely 66.0 Baseline to performgfk well [7] Baseline Baseline GFK OUR GFK OUR Table Classification 2. Recognition Results accuracy with unsupervised DA using a NN OUR Table 3. Recognition accuracy with unsupervised DA using a SVM classifier (Office dataset + Caltech10). Visual Table 3. Recognition domain adaptation accuracy withperformance unsupervised DAwith using Office/Caltech10 classifier(office datasets: + InCaltech10). this experiment we evaluate the a SVM classifier(office dataset + Caltech10). - ImageNet Method (I), C A LabelMe D A W A (L) A CandD C Caltech-256 W C (C) datasets different methods using Office [14]/Caltech10 [8] datasets Method L C L I C L C I I L I C AVG r Baseline whichmethod L C L I C L C I I L I C consist of four domains (A, C, D and W). The results forna the 12 DA 46.0 problems in the31.3 unsupervised setting37.9 Baseline1 NA AVG Method NA L C 46.0 L I 38.4 C L 29.5 C I 31.3 I L 36.9 I C 45.5 AVG Baseline GFK usingbaseline1 a NN classifier 24.2 are 27.2 shown 46.9 in Table In 9 out 33.8 of the 34.9 Baseline1 Baseline OUR DABaseline2 problems24.6 our method outperforms 42.0 the 35.6other 33.8ones Baseline2 GFK Method A D C D W D A W C W D W r. GFK The results obtained in the semi-supervised DA setting (see GFK OUR OUR Baseline supplementary material) confirm this behavior. Here our Table OUR 4. Recognition accuracy 43.8with50.9 unsupervised 46.3 DA 62.8with50.1 NN r TableBaseline 4. Recognition accuracy 35.3 with 73.6unsupervised DA with 80.5NN Table classifier 5. Recognition (ImageNet (I), accuracy LabelMe with(l) unsupervised and Caltech-256 DA with (C)). SVM method classifier outperforms GFK (ImageNet 37.9 the (I), others LabelMe 36.1 in 10 (L) 74.6 DA and Caltech-256 problems (C)) classifier (ImageNet (I), LabelMe (L) and Caltech-256 (C)). The results OURobtained 38.8with39.4 a SVM 77.9classifier 39.6 in38.9 the unsupervised TableDA 3. Recognition case are shown accuracyinwith Table unsupervised 3. Our method DA using out Amaury Habrard (LaHC) Domain a SVMAdaptation using Caltech-256 images is shown in Figure 15/05/ The near- 19 / 32

25 Unsupervised Subspace alignement- conclusion Conclusion Very simple and intuitive method Totally unsupervised Theoretical results for dimensionality detection Good results on computer vision datasets Can be combined with supervised information (future work) Subspace alignement offers theoretical and practical perspectives. Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

26 Outline 1 Domain Adaptation - Introduction 2 What the theory says 3 Unsupervised Visual Domain Adaptation Using Subspace Alignment - ICCV 2013 Joint work with B. Fernando and T. Tuytelaars (K.U. Leuven), and M. Sebban 4 A PAC-Bayesian Approach for Domain Adaptation (PBDA) - ICML 13 Joint work with P. Germain and F. Laviolette (U. Laval, Canada), and E. Morvant (LIF, Marseille - now post-doc IST Austria) 5 Conclusion Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

27 A PAC-Bayesian Approach for Domain Adaptation (PBDA) PAC-Bayesian Theory (1/3) Objective To offer generalization guarantees for majority vote classifiers (bayesian inference, boosting,... ) Especially, for the ρ-weighted majority vote [ ] B ρ(x) = sign ρ(h)h(x) where ρ is the posterior distribution over H learned from the prior distribution π over H such that R PS (B ρ) is as small as possible We work on the Gibbs classifier h H We have: R PS (B ρ) 2R PS (G ρ) R PS (G ρ) = E h ρ R PS (h) Generalization bound [ 2 R PS (G ρ) R S (G ρ) + KL(ρ π)+ln 8 ] m m δ Amaury Habrard (LaHC) ρ(h) Domain Adaptation 15/05/ / 32

28 A PAC-Bayesian Approach for Domain Adaptation (PBDA) A Domain Divergence suitable for PAC-Bayes Definition Let H be a hypothesis class. For any marginal distributions D S and D T over X, any distribution ρ on H, the domain disagreement dis ρ(d S, D T ) between D S and D T is [ dis ρ(d S, D T ) = E RDT (h, h ) R DS (h, h ) ] h,h ρ 2 Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

29 A PAC-Bayesian Approach for Domain Adaptation (PBDA) Domain Adaptation Bound for the Gibbs Classifier Theorem Let H be a hypothesis class. For every distribution ρ on H, we have R PT (G ρ) R PS (G ρ) + dis ρ(d S, D T ) + ν ρ with ρ T = argmin ρ R PT (G ρ) is the best target posterior and ν ρ = R PT (G ρ T ) + E [R DT (h, h ) + R DS (h, h )] h ρ E h ρ T Comparaison between dis ρ(d S, D T ) and 1 2 d H H(D S, D T ) 1 2 d H H(D S, D T ) is a worst case divergence dis ρ(d S, D T ) is specific to the considered G ρ 1 we have: d 2 H H(D S, D T ) dis ρ(d S, D T ) One solution for PAC-Bayesian DA Jointly minimizing R PS (G ρ) and dis ρ(d S, D T ) with theoretical justification Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

30 A PAC-Bayesian Approach for Domain Adaptation (PBDA) Domain Adaptation Bound for the Gibbs Classifier - Consistency bound Theorem: PAC-Bayesian Generalization bound generalization (McAllester s style) For any domain P S and P T (respectively with marginal D S and D T ) over X Y, and for any set H of hypothesis, for any prior distribution π over H, any δ (0, 1], with a probability at least 1 δ over the choice of S 1 (D S ) m 1, S 2 (D S ) m 2, and T (D T ) m, for every ρ over H, we have, R PT (G ρ) R S (G ρ)+dis ρ(s, T ) + ν ρ+ 3 2 m [ KL(ρ π)+ln 8 m δ where m = max{m 1, m 2, m } and ν ρ = R PT (G ρ T ) + R DT (G ρ, G ρ T ) + R DS (G ρ, G ρ T ). ] Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

31 = 1 2 kwk2 :Regularizer Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32 Algorithm: Learn ρ - optimization problem eorem??, we use PAC-Bayesian theory specialized to linear Algorithm for linear classifiers with ρ and π isotropic Gaussians centered on w and 0 s R S (G ρw ) = E φ(y s w.x ) (sigmoidal loss φ(a) = exp( z 2 )dz), (x s,y s ) P x s a S hs,t i R d Y R d and any 2 (0, 1], wehave, KL(ρ w π 0) = 1 2 w 2, dis ρw (S, T ) = E φ dis ( w.xs ) E φ x s D x s dis ( w.x t w 2 R d : kl B ) (φ S x t x D t dis = 2φ(a)φ( a)). hs,t i B PhS,T i apple 1» 2KL( w T k 0 )+ln (m) «1, m The optimization problem is similar to learning a linear classifier w) = R PS (G w )+dis w (D S, D T ). argmin w R S (G ρw )+ C dis ρw (S, T ) + A w 2 A and C being parameters to tune. ween: E ) (x s,y s ) P S tropic ) = E (x s,y s dis ) P S y s w xs kx s k w w x s kx s k E x t D T dis w x t kx t k

A PAC-Bayesian Approach for Domain Adaptation (PBDA) Experimentations - Setup Comparison with SVM, DASVM [Bruzzone et al.,pami 10] and CODA [Chen et al.

32 A PAC-Bayesian Approach for Domain Adaptation (PBDA) Experimentations - Setup Comparison with SVM, DASVM [Bruzzone et al.,pami 10] and CODA [Chen et al.,nips 12] (PBDA 5 times faster than DASVM and CODA) 1 Toy problem inter-twinning moons source domain 7 different target domains according to 7 rotation angles 10 draws for each angle Performances on a test set of 1500 target instances Gaussian kernel Amazon reviews 2 Sentiment Analysis Dataset (text reviews on Amazon products) Books Dvds Electronics Kitchen 4 types of products data dimension: 40, DA tasks: adaptation from one type to another (e.g. books kitchen) Source domain: 2, 000 labeled examples Target domain: 2, 000 unlabeled examples Performances on a Target test set: between 3, 000 and 6, 000 examples Linear Kernel Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

33 A PAC-Bayesian Approach for Domain Adaptation (PBDA) Experimentations - Inter-twinning moons (2/2) (a) 10 (d) 40 Amaury Habrard (LaHC) (b) 20 (e) 50 Domain Adaptation (c) 30 (f) 70 15/05/ / 32

34 A PAC-Bayesian Approach for Domain Adaptation (PBDA) Experimentations - Sentiment Analysis Books DVDs Books Electronics Books Kitchen PBGD SVM DASVM CODA PBDA DVDs Books DVDs Electronics DVDs Kitchen PBGD SVM DASVM CODA PBDA Electronics Books Electronics DVDs Electronics Kitchen PBGD SVM DASVM CODA PBDA Kitchen Books Kitchen DVDs Kitchen Electronics Average PBGD SVM DASVM CODA PBDA Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

35 A PAC-Bayesian Approach for Domain Adaptation - conclusion Conclusion The first PAC-Bayesian Analysis for domain adaptation expressed as a ρ-average over a class of hypothesis a divergence depending on ρ has the advantage to be directly optimizable (with theoretical justification) A first algorithm specialized to linear classifiers with promising results opens the doors to tackle DA tasks by making use of all the PAC-Bayesian tools Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

36 Outline 1 Domain Adaptation - Introduction 2 What the theory says 3 Unsupervised Visual Domain Adaptation Using Subspace Alignment - ICCV 2013 Joint work with B. Fernando and T. Tuytelaars (K.U. Leuven), and M. Sebban 4 A PAC-Bayesian Approach for Domain Adaptation (PBDA) - ICML 13 Joint work with P. Germain and F. Laviolette (U. Laval, Canada), and E. Morvant (LIF, Marseille - now post-doc IST Austria) 5 Conclusion Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

37 Conclusion and Perspectives Domain adaptation/transfer learning is a hot topic. Many domain application need such technology : image classification, computer vision, multimedia indexing, speech recognition, natural language processing,... Domain goes fast, many methods exist The first theories were good to understand how it can work, but there are still settings that are not fully understood (importance of the distance) Change of space representations: lots of possible directions Approaches taking into account multi-source and multi-task settings become more and more popular Classifier combination approaches combined with new space representation learning methods (w.r.t. an appropriate distance) is promising. Controlling negative transfer is an important issue. Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

38 Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

39 A word about parameter tuning: open problem Reverse Classifier h r l and Validation Problem: No label on target domain Solution: Kind of reverse validation [Zhong et al.,ecml 10] With the reverse classifier h r l LS TS 1 Learning of TS + + h l from LS U TS h l LS 2 Auto Labeling TS r + 4 Evaluation h r l of h on LS l by cross-validation 3 Learning of r h from TS auto labeled l of TS with hl Two domains are related h r l performs well on the source domain [Bruzzone et al. PAMI10] Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

40 A PAC-Bayesian Approach for Domain Adaptation (PBDA) Domain Adaptation Bound for the Gibbs Classifier - from which the algo is designed Theorem: PAC-Bayesian Generalization bound generalization (Catoni s style) For any domain P S and P T (resp. with marginal D S and D T ) over X Y, any set of hypothesis H, any prior distribution π over H, any δ (0, 1], any real numbers α > 0 and c > 0, with a probability at least 1 δ over the choice of S T (P S D T ) m, we have, ρ H, R PT (G ρ) ν ρ + α 1 + c R S (G ρ) + α dis ρ(s, T ) ( c + c + 2 ) α KL(ρ π) + ln 3 δ α m where ν ρ = R PT (G ρ T ) + R DT (G ρ, G ρ T ) + R DS (G ρ, G ρ T ), c = c 1 e c, and α = 2α 1 e 2α = Similarly to PBGD, we have specialized it to a set of linear classifiers (PBDA) Amaury Habrard (LaHC) Domain Adaptation 15/05/ / 32

PAC-Bayesian Learning and Domain Adaptation

PAC-Bayesian Learning and Domain Adaptation Pascal Germain 1 François Laviolette 1 Amaury Habrard 2 Emilie Morvant 3 1 GRAAL Machine Learning Research Group Département d informatique et de génie logiciel