Advanced Machine Learning

Size: px

Start display at page:

Download "Advanced Machine Learning"

Claire Copeland
6 years ago
Views:

1 Advanced Machine Learning Domain Adaptation MEHRYAR MOHRI COURANT INSTITUTE & GOOGLE RESEARCH.

2 Non-Ideal World time real world ideal domain sampling page 2

3 Outline Domain adaptation. Multiple-source domain adaptation. page 3

4 Domain Adaptation Sentiment analysis. Language modeling, part-of-speech tagging. Statistical parsing. Speech recognition. Computer vision. Solution critical for applications. page 4

5 This Talk Domain adaptation Discrepancy Theoretical guarantees Algorithm Enhancements page 5

6 Domain Adaptation Problem Domains: source, target. Input: labeled sample unlabeled sample (Q, f Q ) (P, f P ) S drawn from source. T drawn from target. h Problem: find hypothesis in with small expected loss with respect to target domain, that is H L P (h, f P )= E L h(x),f P (x). x P page 6

7 Sample Bias Correction Pb Problem: special case of domain adaptation with f Q = f P. supp(q) supp(p ). page 7

8 Related Work in Theory Single-source adaptation: d A et al. (1996); Kifer et al. (2004); Ben-David et al. (2007)). relation between adaptation and the distance (Devroye a few negative examples of adaptation (Ben-David et al. (AISTATS 2010)). analysis and learning guarantees for importance weighting (Cortes, Mansour, and MM (NIPS 2010)). page 8

9 Related Work in Theory Multiple-source: same input distribution, but different labels (Crammer et al., 2005, 2006). theoretical analysis and method for multiple-source adaptation (Mansour, MM, Rostamizadeh, 2008). page 9

10 Distribution Mismatch Q P Which distance should we use to compare these distributions? Adaptation Theory and Algorithms - Mohri@ page 10

11 Simple Analysis Proposition: assume that the loss is bounded by, then Proof: L Q (h, f) L P (h, f) ML 1 (Q, P ). L P (h, f) L Q (h, f) = E x P L (h(x),f(x) E x Q L (h(x),f(x) L M = M x x P (x) Q(x) L (h(x),f(x) P (x) Q(x). But, is this bound informative? Adaptation Theory and Algorithms - Mohri@ page 11

12 Example - Zero-One Loss a a h f a L Q (h, f) L P (h, f) = Q(a) P (a) Adaptation Theory and Algorithms - Mohri@ page 12

13 Discrepancy Definition: (Mansour, MM, Rostami, COLT 2009) disc(p, Q) = max h,h H L P (h, h ) L Q (h, h ). symmetric, triangle inequality, in general not a distance. helps compare distributions for arbitrary losses, e.g. hinge loss, or generalization of L p d A loss. (2004); Ben-David et al. (2007)). distance (Devroye et al. (1996); Kifer et al. Adaptation Theory and Algorithms - Mohri@ page 13

14 Discrepancy - Properties L q M >0 1 Theorem: for loss bounded by, for any, with probability at least, disc(p, Q) disc(p,q)+4q R S (H)+R T (H) +3M log 4 2m + log 4 2n. Adaptation Theory and Algorithms - Mohri@ page 14

15 Discrepancy = Distance Theorem: let be a universal kernel (e.g., Gaussian kernel) and H = {h H K : h K }. Then, for the L 2 loss, discrepancy is a distance over a compact set. Proof: for norm, thus continuous on C(X). K : h E x P [h 2 (x)] E x Q [h 2 (x)] disc(p, Q)=0 (h)=0 h H implies for all. H C(X) =0 C(X) since is dense in, over. E P [f] E Q [f]=0 f 0 C(X) thus, for all in. P =Q this implies. (Cortes & MM (TCS 2013)) X is continuous Adaptation Theory and Algorithms - Mohri@ page 15

16 Theoretical Guarantees Two types of questions: difference between average loss of hypothesis on Q versus? difference of loss (measured on ) between hypothesis obtained when training on (Q, f Q ) versus hypothesis obtained when training on? P (P,f P ) h P h h Adaptation Theory and Algorithms - Mohri@ page 16

17 Notation: Generalization Bound L Q (h Q,f Q )=min L Q(h, f Q ) h2h. L P (h P,f P )=min. Theorem: assume that the following holds: h2h L P (h, f P ) L (Mansour, MM, Rostami (COLT 2009)) obeys the triangle inequality, then L P (h, f P ) L Q (h, h Q )+L P (h P,f P )+disc(p, Q) + L Q (h Q,h P ). Adaptation Theory and Algorithms - Mohri@ page 17

18 Some Natural Cases h = h Q = h P When, L P (h, f P ) L Q (h, h )+L P (h,f P ) + disc(p, Q). When f P H (consistent case), L P (h, f P ) L Q (h, f P ) disc(q, P ). Bound of (Ben-David et al., NIPS 2006) or (Blitzer et al., NIPS 2007): always worse in these cases. Adaptation Theory and Algorithms - Mohri@ page 18

19 Regularized ERM Algorithms Objective function: F bq (h) = h 2 K + R b Q (h), where K >0 R bq (h) is a PDS kernel; is a trade-off parameter; and is the empirical error of. h broad family of algorithms including SVM, SVR, kernel ridge regression, etc. Adaptation Theory and Algorithms - Mohri@ page 19

20 Guarantees for Reg. ERM K K(x, x) R 2 (Cortes & MM (TCS 2013)) Theorem: let be a PDS kernel with and a loss function such that L(,y) is µ -Lipschitz. Let h 0 be the minimizer of F bp and h that of that F bq, then, for all (x, y) X Y, s L L(h 0 (x),y) L(h(x),y) apple µr disc( b P, b Q)+µ H (f P,f Q ), where H (f P,f Q )= inf h2h max f P (x) h(x) + max f Q (x) h(x). x2supp( P b ) x2supp( Q) b Adaptation Theory and Algorithms - Mohri@ page 20

21 Proof By the property of the minimizers, there exist subgradients such that Thus, 2 h 0 = R bp (h 0 ) 2 h = R bq (h). 2 kh 0 hk 2 = hh 0 h, R bp (h 0 ) R bq (h)i = hh 0 h, R bp (h 0 )i + hh 0 h, R bq (h)i apple R bp (h) R bp (h 0 )+R bq (h 0 ) R bq (h) apple 2disc( P, b Q)+2µ b H (f P,f Q ). Foundations of Machine Learning page 21

22 Guarantees for Reg. ERM Theorem: let be a PDS kernel with and the loss bounded by. Then, for all, L 2 K M K(x, x) R 2 (x, y) (Cortes & MM (TCS 2013)) L L(h (x),y) L(h(x),y) R M disc(p,q), where =min h H E x Q b h(x) f Q (x) K(x) E x P b h(x) f P (x) K(x). K For f P =f Q =f, R =0 f H if is ε-close to on samples. for a Gaussian kernel and continuous. f Adaptation Theory and Algorithms - Mohri@ page 22

23 Empirical Discrepancy Discrepancy measure disc(p,q) critical term in bounds. Smaller empirical discrepancy guarantees closeness of pointwise losses of h and h. But, can we further reduce the discrepancy? Adaptation Theory and Algorithms - Mohri@ page 23

24 Algorithm - Idea Search for a new empirical distribution support: q with same q = argmin supp(q) supp( b Q) Solve modified optimization problem: disc(p,q). min h F q (h) = m i=1 q (x i )L(h(x i ),y i )+ h 2 K. Adaptation Theory and Algorithms - Mohri@ page 24

25 Case of Halfspaces Adaptation Theory and Algorithms - Mohri@ page 25

26 Min-Max Problem Reformulation: Q = argmin bq Q max L b h,h H P (h,h) L bq (h,h). game theoretical interpretation. gives lower bound: max min L bp (h,h) L bq (h,h) h,h H bq Q min bq Q max L b h,h H P (h,h) L bq (h,h). page 26

27 Classification - 0/1 Loss Problem: min Q max a H H subject to Q (a) P (a) x S Q, Q (x) 0 x S Q Q (x) =1. page 27

28 Classification - 0/1 Loss Linear program (LP): min Q subject to a H H, Q (a) P (a) a H H, P (a) Q (a) x S Q, Q (x) 0 x S Q Q (x) =1. No. of constraints bounded by shattering coefficient. H H(m 0 + n 0 ) page 28

29 Algorithm - 1D page 29

30 Regression - L2 Loss Problem: min bq Q max E [(h (x) h(x)) 2 ] E[(h (x) h(x)) 2 ]. h,h H P b bq min bq 0 2Q =min bq 0 2Q =min bq 0 2Q =min bq 0 2Q max kwkapple1 kw 0 kapple1 max kwkapple1 kw 0 kapple1 max kukapple2 E bp [((w 0 w) > x) 2 ] E bq 0[((w 0 w) > x) 2 ] X ( P b (x) Q b 0 (x))[(w 0 w) > x] 2 x2s X ( P b (x) Q b 0 (x))[u > x] 2 x2s max kukapple2 u> X x2s ( b P (x) b Q 0 (x))xx > u. page 30

31 Regression - L2 Loss Problem equivalent to Xm 0 with: M(z) =M 0 M 0 = X x2s min max kzk 1 =1 kuk=1 u> M(z)u, z 0 i=1 P (x)xx > z i M i, M i = s i s > i elements of supp(q) Foundations of Machine Learning page 31

32 Regression - L2 Loss Semi-definite program (SDP): linear hypotheses. min z, subject to I M(z) 0 I + M(z) 0 1 z =1 z 0, where the matrix M(z) = M(z) is defined by: x S P (x)xx m 0 i=1 z i s i s i. page 32

33 Regression - L2 Loss SDP: generalization to RKHS for some kernel. with: min z, H subject to I M(z) 0 M(z) =M 0 m 0 I + M(z) 0 1 z =1 z 0, i=1 z i M i M 0 = K 1/2 diag(p (s 1 ),...,P(s p0 ))K 1/2 M i = K 1/2 I i K 1/2. K page 33

34 Discrepancy Min. Algorithm Convex optimization: cast as semi-definite programming (SDP) prob. efficient solution using smooth optimization. Algorithm and solution for arbitrary kernels. Outperforms other algorithms in experiments. (Cortes & MM (TCS 2013)) Adaptation Theory and Algorithms - Mohri@ page 34

35 Experiments Classification: Q and P Gaussians. H: halfspaces. f: interval [-1, +1] w/ min disc w/ orig disc # Training Points page 35

36 Experiments Regression: Q Q P # Training Points SDP solved in about 15s using SeDuMi on 3GHz CPU with 2GB memory. page 36

37 Experiments page 37

38 Shortcomings: Enhancement discrepancy depends on maximizing pair of hypotheses. DM algorithm too conservative. Ideas: finer quantity: generalized discrepancy, hypothesisdependent. reweighting depending on hypothesis. (Cortes, MM, and Muñoz (2014)) page 38

39 Choose Ideally: Q h Algorithm such that objectives are unif. close: h 2 K + L Q h (h, f Q ) h 2 K + L P b(h, f P ). (Cortes, MM, and Muñoz (2014)) Q h =argmin q Using convex surrogate H : Q h =argmin q L q (h, f Q ) L bp (h, f P ). max L q(h, f Q ) L(h, h ). h H page 39

40 L Qh (h, f Q )= Optimization argmin l {L q (h,f Q ):q F(S X,R)} =argmin l R max l L b h H P (h, h ) (Cortes, MM, and Muñoz (2014)) max l L b h H P (h, h ) = 1 2 max L b h H P (h, h )+ min L b h H P (h, h ). Convex optimization problem (loss jointly convex): min h h 2 K max L b h H P (h, h )+ min L b h H P (h, h ). page 40

41 Convex Surrogate Hyp. Set Choice of H among balls B(r) ={h H L q (h,f Q ) r p }. (Cortes, MM, and Muñoz (2014)) Generalization bound proven to be more favorable than DM for some choices of radius r. r Radius chosen via cross-validation using small amount of labeled data from target. Further improvement of empirical results. page 41

42 Conclusion Theory of adaptation based on discrepancy: key term in analysis of adaptation and drifting. discrepancy minimization algorithm DM. compares favorably to other adaptation algorithms in experiments. Generalized discrepancy: extension to hypothesis-dependent reweighting. convex optimization problem. further empirical improvements. page 42

43 Outline Domain adaptation. Multiple-source domain adaptation. page 43

44 Problem Formulation Given distributions and corresponding hypotheses: h 1 h 2. h k Notation: D 1 D 2 D k f. unknown target L(D i,h i,f)= E x i, L(D i,h i,f). each hypothesis performs well in its domain. Di [L(h i (x),f(x))]. Loss L assumed non-negative, bounded, convex and continuous. page 44

45 Problem Formulation The unknown target distribution is a mixture of input distributions. D 1. D T D T (x) = k i=1 id i (x) D k Task: choose a hypothesis mixture that performs well in target distribution. h z (x) = k i=1 z i h i (x) convex combination rule k z i D i (x) h z (x) = k i=1 j=1 z jd j (x) h i(x) distribution weighted combination page 45

46 Known Target Distribution For some distributions, any convex combination performs poorly. distribution weights DT D0 D1 a b hypothesis output f h0 h1 a b base hypotheses have no error within domain. any convex combination has error of 1/2! page 46

47 Main Results Thus, although convex combinations seem natural, they can perform very poorly. We will show that distribution weighted combinations seem to define the right combination rule. There exists a single robust distribution weighted hypothesis, that does well for any target mixture. f, z,, L(D,h z,f). page 47

48 Known Target Distribution If distribution is known, distribution weighted rule will always do well. Choose: z =. k id i (x) h (x) = i=1 k j=1 jd j (x) h i(x). Proof: L(D T,h,f)= = x X x X k i=1 L(h (x),f(x))d T (x) k i=1 id i (x) D T (x) L(h i(x),f(x))d T (x) il(d i,h i (x),f(x)) k i=1 i =. page 48

49 Unknown Target Mixture Zero-sum game: NATURE: select a target distribution. z LEARNER: select a, i.e. a distribution weighted hypothesis. h z L(D i,h z,f) Payoff:. Already shown: game value is at most. Minimax theorem (modulo discretization of ): there exists a mixture of distribution weighted hypothesis that does well for any distribution mixture. D i j jh zj z page 49

50 Balancing Losses Brouwer s Fixed Point theorem: for any compact, convex, non-empty set A and any continuous function f: A A, there exists such that:. x f(x)=x A f Notation: L z i :=L(D i,h z,f). Define mapping by: [ (z)] i = z il z i j z jl z j By fixed point theorem (modulo continuity):. z : i, z i = z il z i j z jl z j = i, L z i = j z j L z j =:. page 50

51 Bounding Loss z For fixed point, L(D z,h z,f)= L(h z (x),f(x)) = = x X k i=1 k i=1 Also, by convexity, k i=1 z i D i (x) z i x X D i (x)l(h z (x),f(x)) z i L z i = k i=1 z i =. = L(D z,h z,f) x X k i=1 z i D i (x) D z (x) L(h i(x),f(x))d z (x) = k i=1 z i L(D i,h i,f). page 51

52 Bounding Loss Thus, and for any mixture, k L(D,h z,f)= i=1 il(d i,h z,f) k i=1 i =. D 2 h z D 1. D i f page 52

53 Details To deal with non-continuity refine hypotheses: h z (x) = Theorem: for any target function and any, If loss obeys triangle inequality: k i=1 z i D i (x)+ /k k j=1 z jd j (x)+ h i(x). holds for all admissible target functions. f >0 >0,z:, L(D,h z,f) +. >0, z, > 0,, f F, L(D,h z,f) 3 +. page 53

54 A Simple Algorithm A simple constructive algorithm, choose z with uniform weights: L(D,h u,f)= = x x x k i=1 x D (x)l k m=1 k i=1 md m (x) k m=1 md m (x) k j=1 D j(x) 1 D i (x) k j=1 D j(x) h i(x),f(x) L k i=1 k i=1 D i (x)l (h i (x),f(x)) = D i (x) k j=1 D j(x) h i(x),f(x) D i (x)l (h i (x),f(x)) k i=1 L(D i,h i,f)= k i=1 i k. page 54

55 Preliminary Empirical Results Sentiment Analysis - given a product review (text string), predict a rating (between 1.0 and 5.0). 4 Domains: Books, DVDs, Electronics and Kitchen Appliances. Base hypotheses are trained within each domain (Support Vector Regression). We are not given the distributions. We model each distribution using a bag of words model. We then test the distribution combination rule on known target mixture domains. page 55

56 Empirical Results Uniform Mixture Over 4 Domains In Domain Out Domain 1.9 MSE page 56

57 Empirical Results 2 class MSE Mixture = α book + (1 α) kitchen weighted linear book kitchen α page 57

58 Conclusion Formulation of the multiple source adaptation problem. Theoretical analysis for mixture distributions. Efficient algorithm for finding distribution weighted combination hypothesis? Beyond mixture distributions? page 58

59 Rényi Divergences Definition: for 0, D (P kq) = 1 1 log X x P (x) apple P (x) Q(x) 1. =1 =2 : coincide with relative entropy. : logarithm of expected probability ratio; =+1 D (P kq) = log E x P apple P (x) Q(x) : logarithm of maximum probability ratio; apple P (x). Q(x) D (P kq) = log sup x P. Foundations of Machine Learning page 59

60 Extensions - Arbitrary Target >0 Theorem: for any and > 1, 9,z: 8P, L(P, h z,f) apple P (Mansour, MM, and Rostami, 2009) h i 1 d (P kq)( + ) M 1. Q= k i=1 id i : k measured in terms of Rényi divergence, apple X P (x) d (P, Q)= Q 1 (x) x 1 1. page 60

61 Proof By Hölder s inequality, for any hypothesis h, L(P, h, f) = X x apple X apple x P (x) Q 1 (x) Q 1 P x) Q 1 (x) =(d (P kq)) 1 =(d (P kq)) 1 apple (d (P kq)) 1 h h (x)l(h(x),f(x)) 1 h X x i 1 Q(x)L 1 (h(x),f(x)) i 1 E [L 1 (h(x),f(x))] x Q i 1 E [L(h(x),f(x))L 1 1 (h(x),f(x))] x Q h L(Q, h, f)m 1 1 i 1. Foundations of Machine Learning page 61

62 Other Extensions Approximate distributions (estimated): similar results shown depending on divergence between true and estimated distributions. Different source target functions : similar results when target functions close to f on target distribution. f i (Mansour, MM, and Rostami, 2009) page 62

63 References S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In NIPS S. Ben-David, T. Lu, T. Luu, and D. Pal. Impossibility theorems for domain adaptation. Journal of Machine Learning Research-Proceedings Track, 9: , S. Bickel, M. Bruckner, and T. Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10: , J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain adaptation. In NIPS Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In NIPS Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519, page 63

64 References Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection bias correction theory. In Proceedings of ALT Koby Crammer, Michael Kearns, and Jennifer Wortman. Learning from Data of Variable Quality. In Proceedings of NIPS, Koby Crammer, Michael Kearns, and Jennifer Wortman.Learning from multiple sources.in Proceedings of NIPS, Devroye, L., Gyorfi, L., and Lugosi, G. (1996). A probabilistic theory of pattern recognition. Springer. M. Dredze, J. Blitzer, P. P. Talukdar, K. Ganchev, J. Graca, and F. Pereira. Frustratingly Hard Domain Adaptation for Parsing. In CoNLL, J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, and B. Scholkopf. Correcting sample selection bias by unlabeled data. In NIPS, volume 19, pages page 64

65 References Kifer D., Ben-David S., Gehrke J. Detecting change in data streams. In: Proceedings of VLDB, pp Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In COLT Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation with multiple sources. In Advances in NIPS, pages Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Multiple source adaptation and the Rényi divergence. In Proceedings of UAI Mehryar Mohri and Andres Muñoz Medina. New analysis and algorithm for learning with drifting distributions. In Proceedings of ALT, volume 7568, pages page 65

66 References H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2): , M. Sugiyama, S. Nakajima, H. Kashima, P. von Bunau, and M. Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In NIPS M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von Bunau, and M.Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60: , page 66

Learning with Imperfect Data

Learning with Imperfect Data Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.