Instance-based Domain Adaptation

Size: px

Start display at page:

Download "Instance-based Domain Adaptation"

Francine Cook
5 years ago
Views:

1 Instance-based Domain Adaptation Rui Xia School of Computer Science and Engineering Nanjing University of Science and Technology 1

2 Problem Background Training data Test data Movie Domain Sentiment Classifier 85% 65% 75% 2

3 How to find the right training data? Picture from IJCAI-13 tutorial: Transfer Learning with Applications 3

4 A Summary of Domain Adaptation Methods in NLP Methods Description References Semi-supervised based Parameter based Feature based Instance based Build a semi-supervised classifier based on the source domain labeled data, and target domain unlabeled data Assume that models of the source and target domain share the same prior parameter Learn a new feature representation (or a new labeling function) for the target domain Learn the importance of labeled data in the source domain by instance selection and instance weighting [Aue, 2005; Tan, 2007; Dai, 2007; Tan, 2009] [Chelba, 2006; Li, 2009; Xue, 2008; Finkel, 2009] [Daume III, 2007; Blitzer, 2007; Gao, 2008; Pan, 2009; Pan, 2010b; Ji, 2011; Xia, 2011; Samdani, 2011; Glorot, 2011; Duan, 2012] [Sugiyama, 2007; Bickel, 2009; Axelrod, 2011] 4

5 Our Work PUIS/PUIW (Instance Selection and Instance Weighting via PU learning) Rui Xia, Xuelei Hu, Jianfeng Lu, Jian Yang, and ChengqingZong. Instance Selection and Instance Weighting for Cross-domain Sentiment Classification via PU Learning. IJCAI ILA (Instance Adaptation via In-target-domain Logistic Approximation) Rui Xia, Jianfei Yu, Feng Xu, and Shumei Wang. Instance-based Domain Adaptation in NLP via In-target-domain Logistic Approximation. AAAI

6 Part 1. PUIS/PUIW (Instance Selection and Instance Weighting via PU Learning) 6

7 Introduction to PU learning What is PU learning? Give a set of positive (P) samples of a particular class, and a set of unlabeled (U) samples that contains hidden positive and negative ones Build a binary-class classifier using P and U Relation to traditional semi-supervised learning Traditional semi-supervised learning : Learning with a small set of labeled data and a large set of unlabeled data (LU learning) PU learning: labeled data only contain positive samples 7

8 Illustration of PU Learning P P U Nr (reliable negative) Ur (remaining unlabeled) Apply traditional semi-supervised learning, based on P, Nr and Ur positive hidden positive hidden negative 8

9 A Motivation Example of PUIS/PUIW Large-size source domain training data are used as U set (blue cross) Some target domain unlabeled data are used as P set (red dot) PUL firstly identifies a reliable negative set Nr from U, and secondly builds a semi-supervised classifier using P, Nr, and U-Nr 9

10 PU Learning for Instance Selection (PUIS) Step 1: Detect reliable negative samples by sampling some spy samples and building a supervised NB classifier Step 2: Build a semisupervised NB based on EM training (NBEM) Step 3: Use the NBEM classifier as an in-targetdomain selector 10

11 The Instance Weighting Framework Expected log-likelihood of the target domain L(µ) = ¼ Instance Weighting: Density Ratio Estimation Z Z = x2x x2x Z x2x X y2y p t (x; y) log p(x; yjµ)dx p t (x) p s (x) p s(x) X y2y r(x)p s (x) X y2y p s (yjx) log p(x; yjµ)dx p s (yjx) log p(x; yjµ)dx Instance-weighted maximum likelihood estimation µ = arg max µ 1 N s XN s n=1 r(x n )log p(x n ; y n jµ) 11

12 PU Learning for Instance Weighting (PUIW) In-target-domain Sampling Assumption q t (x) = r(x)p s (x) / p(d = 1jx)p s (x) Instance Weighting via PU learning and Probability Calibration r(x) = p s(x) p t (x) / p(d = 1jx) = exp f(x) The calibration parameter The log-likelihood output of PU learning 12

13 Instance-weighted Classification Model Learning in traditional Naïve Bayes p(c j ) = P k I(y k = c j ) P j 0 I(y k = c 0 j ) = N j N P k p(t i jc j ) = P k I(y k = c j )N(t i ; x k ) P k I(y k = c j ) P i 0 =1 N(t i 0;x k) Learning in instance-weighted Naïve Byes p(c j ) = P k I(y k = c j )r(x k ) P k r(x k) p(t i jc j ) = P k I(y k = c j )r(x k )N(t i ;x k ) Pk I(y k = c j )r(x k ) P V i 0 =1 N(t i 0; x k) Note: PUIW can also be applied to Discriminative model (e.g., weighted MaxEnt, weighted SVMs) 13

14 Experiments on Cross-domain Sentiment Classification Setting 1 (with a normal size of training set) Source domain: Movie (training set size: 2,000) Target domains: {Book, DVD, Electronics, Kitchen} Setting 2 (with a large size of training set) Source domain: Video (training set size: 10,000) Target domains: 12 domains of reviews extracted from the Multi-domain dataset 14

15 Classification Normal-size Training Set Task KLD ALL PUIS PUIW Movie Book Movie DVD Movie Electronics Movie Kitchen Average

16 Accuracy Normal-size Training Set Movie Book 10-5 Alpha Movie DVD 10-5 Alpha Accuracy PUIW PUIS Random Selected sample size Accuracy 0.75 PUIW 0.7 PUIS Random Selected sample size Accuracy Movie Eletronics 10-4 Alpha PUIW PUIS Random Accuracy Movie Kitchen 10-4 Alpha PUIW PUIS Random Selected sample size Selected sample size For Random and PUIS, the bottom x-axis (number of selected samples) is used; For PUIW, the top x-axis (calibration parameter alpha) is used. 16

17 Large-size Training Set Task KLD ALL PUIS PUIW Video Apparel Video Baby Video Books Video Camera Video DVD Video Electronics Video Health Video Kitchen Video Magazines Video Music Video Software Video Toys Average

18 Accuracy A Large-size Training Set Accuracy Accuracy Accuracy Video Apparel 10-4 Alpha PUIW PUIS Random Selected sample size Video Baby 10-4 Alpha PUIS PUIW Random Selected sample size Video Books 10-4 Alpha PUIW PUIS Random Selected sample size Accuracy Accuracy Accuracy Video Camera 10-4 Alpha PUIW PUIS Random Selected sample size Video Health Alpha PUIW PUIS Random Selected sample size Video Kitchen 10-4 Alpha PUIS PUIW Random Selected sample size For Random and PUIS, the bottom x-axis is used; For PUIW, the top x-axis is used. 18

19 Conclusions from the Results The effect of adding more training samples When the size of training data is small (<2000) When the size of training data becomes larger (>3000) The necessity of PUIS/PUIW The improvements of both PUIS/PUIW are significant PUIW is overall better than PUIS The stability of PUIS/PUIW The number of selected samples in PUIS is hard to determine The curve of PUIW is unimodal(best \alpha 0.1) 19

20 The Relation of K-L Distance and Accuracy Improvement Acc improvement (%) Experimental Setting KL divergence Experimental Setting KL divergence KLD and Accuracy Improvement are in a roughly linear relation. 20

21 Part 2. Instance Adaptation via In-target-domain Logistic Approximation (ILA) 21

22 In-target-domain Sampling Model PUIW (PU learning for Instance Weighting) q t (x) = r(x)p s (x) 1 / 1 + exp f(x) p s(x) ILA (In-target-domain logistic approximation) q t (x) / p(d = 1jx)p s (x) 1 = 1 + e p!t x s(x) Shortcomings in PUIW: PU the Learning log-likelihood and Probability output Calibration of PU learning are conducted separately In ILA: We will learn the instance weight in one single model 22

23 Illustration of instance-weighted sampling Sampling weights: [1, 1, 1] Sampling weights: [0.5, 1.5, 1] Sampling weights: [1.5, 0.5, 1] f1 f2 f3 f4 f5 f6 0 f1 f2 f3 f4 f5 f6 0 f1 f2 f3 f4 f5 f6 23

24 Illustration of parameter learning criterion Target domain Approximated distribution f1 f2 f3 f4 f5 f Target domain true distribution - = f1 f2 f3 f4 f5 f Distributional Distance f1 f2 f3 f4 f5 f Another Target domain Approximated distribution f1 f2 f3 f4 f5 f Target domain true distribution - = f1 f2 f3 f4 f5 f Distributional Distance f1 f2 f3 f4 f5 f6 24

25 Parameter Learning in ILA K-L Distance (KLD) of the True and Approximated Target-domain Probability KL(p t (x)jjq t (x)) = min!; Z s:t: KL = max!; x2x Z = Z Z x2x q t (x)dx = x2x x2x p t (x) log p t(x) q t (x) dx p t (x) log p t (x) log Z x2d t p t (x) 1+e!T x p s(x) dx Optimization with normalization constraint 1 + e!t x dx 1 + e!t p s (x)dx = 1 25

26 Empirical Form of KLD Minimization Empirical K-L distance minimization max w;b;c 1 N t s:t: 1 XN s N s j=1 XN t i=1 log 1 + e wt x i 1 + e wt x j = 1 () = P Ns j=1 N s 1 1+e wt x j Final optimization formula (loss function) w = arg max w = arg min w 1 N t XN t i=1 XN t i=1 log log XN s j=1 P N s j=1 N s 1 1+e wt x j 1 + e wt x i 1 1+e wt x j 1 1+e wt x i A standard unconstraint optimization problem! 26

27 Gradient-based Optimization Gradient of the k = 1 P Ns j=1 s(!t x j ) XN s j=1 s(! T x j )(1 s(! T x j ))x j;k 1 XN t (1 s(! T x i ))x i;k N t i=1 where s(! T x j ) = 1 1+e wt x j Optimization method Gradient Descent! (t+1) k =! (t) k Newton, Quasi-Newton, L-BFGS can also work 27

28 Instance-weighted Classification In-target-domain sampling weights for each sourcedomain labeled data r(x n ) = q t(x n ) p s (x n ) = 1 + e!t x n where = P Ns j=1 N s 1 1+e wt x j Instance-weighted classification model Generative: weighted naïve Bayes Discriminative: weighted logistic regression, weighted SVMs 28

29 Instance Adaptation Feature Selection Domain-sensitive Information Gain IG(t k ) = Feature Space X l2f0;1g + p(t k ) X p(d = l)log p(d = l) l2f0;1g + p(¹t k ) X l2f0;1g p(d = ljt k )log p(d = ljt k ) p(d = lj¹t k )log p(d = lj¹t k ) ILA: from original feature space to dimension-reduced feature space x! ^x KLIEP: from original feature space to kernel space x! Á(x) 29

30 Experimental Settings Setting 1: Cross-domain Document Categorization 20 Newsgroups dataset Top categories are used as class labels, and subcategories are used to generate source and target domains [Dai, 2007] Setting 2: Cross-domain Sentiment Classification Source domain: Movie Target domains: {Book, DVD, Electronics, Kitchen} 30

31 Experimental Results Cross-domain Document Categorization Dataset K-L divergence No Adaptation Linear KLIEP Gaussian PUIS PUIW ILA sci vs com talk vs com sci vs talk rec vs sci rec vs com talk vs rec Avg A vs B means that the top category A and B are used as class labels, and subcategories under the top categories are used to generate the source and target domain datasets. 31

32 Experimental Results Cross-domain Sentiment Classification Dataset K-L divergence No Adaptation Linear KLIEP Gaussian PUIS PUIW ILA movie book movie dvd movie elec movie kitchen Average A B denote that we use dataset A as the source domain, and B as the target domain. 32

33 Parameter Sensitivity Effect of Instance Adaptation Feature Selection Accuracy movie dvd movie elec talk vs rec rec vs sci The percentage of selected features ILA can work efficiently in a dimensionality-reduced linear feature space 33

34 Comparison of Parameter Stability Accuracy PUIW 0.95 movie dvd 0.9 movie elec talk vs rec 0.85 rec vs sci calibration parameter α KLIEP movie dvd movie elec talk vs rec rec vs sci kernel parameter σ Parameter stability of PUIW and KLIEP-Gaussian. The x-axis denotes the value of the calibration parameter \alpha in PUIW(left), and the kernel parameter \delta in KLIEP-Gaussian (right). ILA is more convenient at parameter tuning in comparison with PUIW and KLIEP 34

35 Comparison of Computational Efficiency Computational Time Cost Task KLIEP-Gaussian #1000 #100 PUIW ILA Text categorization 19346s 2121s 241s 161s Sentiment classification 5219s 482s 184s 97s Overall Summary Joint model rather than separate model Parameters are learnt rather then tuned Better performance in Classification Accuracy, Parameter Stability, and Computational Efficiency 35

36 Feasibility to apply to the other NLP tasks Classification Task Cross-domain Text / Sentiment Classification Cross-domain NER? Cross-domain WSD? Sequential Learning Task Cross-domain Chinese word segmentation? Cross-domain parsing? Bilingual Alignment Learning Task Cross-domain MT? 36

37 Any Questions? 37

Instance-based Domain Adaptation via Multi-clustering Logistic Approximation

Instance-based Domain Adaptation via Multi-clustering Logistic Approximation FENG U, Nanjing University of Science and Technology JIANFEI YU, Singapore Management University RUI IA, Nanjing University