Learning with Imperfect Data

Size: px

Start display at page:

Download "Learning with Imperfect Data"

Quentin Lucas
6 years ago
Views:

1 Mehryar Mohri Courant Institute and Google Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute).

2 Standard Learning Assumptions IID assumption. Same distribution for training and test. Distributions fixed over time. 2

3 Modern Large-Scale Data Sets Real-world applications: Sample points are not drawn IID. Training sample is biased. Training points with uncertain labels. Multiple training sources. Distribution may drift with time. These problems must be addressed for learning to be effective. 3

4 Domain Adaptation - Problem Input: Labeled data from source domain. Unlabeled data from target domain. Problem: use labeled and unlabeled data to derive hypothesis h with good performance on target domain. Thus, harder generalization problem than standard learning problem! 4

5 Domain Adaptation - Examples Sentiment analysis: appraisal information for some domains, e.g., movies, books, music, restaurants. but no labeled information for travel. Language modeling, part-of-speech tagging, parsing. Speech recognition. Computer vision. 5

6 Related Work Single-source adaptation: language modeling, probabilistic parsers, maxent models: source domain used to define a prior. relation between adaptation and the d A distance [Ben-David et al. (2006) and Blitzer et al. (2007)]. Multiple-source: same input distribution, but different labels [Crammer et al. (2005, 2006)]. theoretical analysis and method for multiplesource adaptation [Mansour et al. (2008)]. 6

7 This Talk Domain adaptation problem Discrepancy distance Theoretical guarantees Algorithm Experiments 7

8 Learning Set-up Distributions: source Q, target P. Target function(s):, or and. Input: labeled sample drawn from Q, unlabeled sample drawn from P. Problem: find hypothesis h with small expected loss with respect to distribution P, L P (h, f) = f f Q f P [ E L ( h(x),f(x) )]. x P 8

9 Distribution Mismatch Q P Which distance should we use to compare these distributions? 9

10 Simple Analysis Proposition: assume that the loss by M, then Proof: is bounded L Q (h, f) L P (h, f) Ml 1 (Q, P ). [ ( )] [ ( )] L Q (h, f) L P (h, f) = E L (h(x),f(x) EP L (h(x),f(x) Q = ( ) ( ) Q(x) P (x) L (h(x),f(x) x M Q(x) P (x). x But, is this bound informative? L 10

11 Example - 0/1 Loss a a h f a L Q (h, f) L P (h, f) = Q(a) P (a) 11

12 d A distance Definition: d A (Q 1,Q 2 ) = sup Q 1 (a) Q 2 (a). a A where A is a set of regions or subsets of X [Devroye et al. (1996), Kifer et al. (2004)], Ben-David et al. (2007), Blitzer et al. (2007)]. For 0/1 loss, the natural choice is the set of all possible disagreement regions: A = H H = { h h : h, h H}. 12

13 Definition: Discrepancy Distance disc(q 1,Q 2 ) = max h,h H L Q1 (h,h) L Q2 (h,h). Relationship with discrepancy in combinatorial contexts [Chazelle (2000)]. d A is a special case, 0-1 loss. helps compare distributions for other losses, e.g. hinge loss, L p loss. symmetric, verifies triangle inequality, in general not a distance. 13

14 Discrepancy - Properties Theorem: the discrepancy distance can be estimated from finite samples for H with finite VC dimension. For L q loss, L q (y, y )= y y q, for any δ >0, with probability at least 1 δ, disc(p, Q) disc( P, Q) + 4q( RS (H)+ R ) T (H) + 3M( ) log 4 δ 2m + log 4 δ. 2n 14

15 This Talk Domain adaptation problem Discrepancy distance Theoretical guarantees Algorithm Experiments 15

16 Theoretical Guarantees Two types of questions: difference between average loss of hypothesis on Q versus P? difference of loss between hypothesis h trained on Q and h trained on P. h 16

17 Notation: L Q (h Q L P (h P Generalization Bound,f) = min L Q(h, f) h H,f) = min L P (h, f) h H Theorem: assume that L obeys the triangle inequality, then the following holds: L P (h, f P ) L Q (h, h Q)+L P (h P,f P ) + disc(p, Q) + L Q (h Q,h P ). 17

18 Some Special Cases When, h = h Q = h P L P (h, f P ) L Q (h, h )+L P (h,f P ) + disc(p, Q). When f P H (consistent case), L P (h, f P ) L Q (h, f P ) disc L (Q, P ). 18

19 Kernel-Based Reg. Algorithms Algorithms minimizing objective function: F bq (h) =λ h 2 K + R bq (h), where is a positive definite symmetric kernel, is a trade-off parameter, and the empirical error of h. K λ>0 R bq (h) family of algorithms including SVMs, SVR, kernel ridge regression, etc. 19

20 Guarantees for KBR Algorithms Theorem: let K be a positive definite symmetric kernel with x, K(x, x) κ and the loss s.t. is -Lipschitz. Assume that f P H and that and coincide on the training sample. Then, for all x X, y Y, L(,y) σ f P f Q L(h (x),y) L(h(x),y) disc( P, Q) κσ. λ 20

21 Guarantees for KBR Algorithms Theorem: same assumptions but f P and f Q potentially different on the training sample, H bounded by M, and L the square loss; then, for all x X, y Y, L(h (x),y) L(h(x),y) ( κδ + 2κM λ κ 2 δ 2 +4λdisc L ( with δ 2 = L bq (f Q (x),f P (x)) 1. ) P, Q), 21

22 Empirical Discrepancy Discrepancy distance bounds. disc( P, Q) critical term in Smaller empirical discrepancy guarantees closeness of pointwise losses of h and h. But, can we further reduce the discrepancy? 22

23 This Talk Domain adaptation problem Discrepancy distance Theoretical guarantees Algorithm Experiments 23

24 Algorithm - Idea The training sample is given, but we can search for a new empirical distribution Q such that Q = argmin bq Q disc( P, Q ), where Q supp( Q) is the set of distributions with support. can be interpreted as reweighting training points. 24

25 Case of Halfspaces 25

26 Reformulation: Q = argmin bq Q Min-Max Problem max L b h,h H P (h,h) L bq (h,h). game theoretical interpretation. gives lower bound: max min L bp (h,h) L bq (h,h) h,h H bq Q min bq Q max L h,h b H P (h,h) L bq (h,h). 26

27 Problem: min Q Classification - 0/1 Loss max a H H subject to Q (a) P (a) x S Q, Q (x) 0 x S Q Q (x) = 1. 27

28 Classification - 0/1 Loss Linear program (LP): min Q subject to δ a H H, Q (a) P (a) δ a H H, P (a) Q (a) δ x S Q, Q (x) 0 No. of constraints bounded by shattering coefficient Π H H (m 0 + n 0 ). x S Q Q (x) = 1. 28

29 Algorithm - 1D 29

30 Problem: min bq Q max h,h H Regression - L2 Loss E[(h (x) h(x)) 2 ] E bp bq [(h (x) h(x)) 2 ]. 30

31 Regression - L2 Loss Semi-definite program (SDP): linear hypotheses. min z,λ where the matrix λ subject to λi M(z) 0 M(z) = x S λi + M(z) 0 1 z =1 z 0, M(z) elements of supp( Q) is defined by: P (x)xx m 0 i=1 z i s i s i. 31

32 Regression - L2 Loss SDP: generalization to H RKHS for some kernel K. with: min z,λ λ subject to λi M(z) 0 M(z) =M 0 λi + M(z) 0 1 z =1 z 0, m 0 i=1 z i M i M 0 = K 1/2 diag(p (s 1 ),...,P(s p0 ))K 1/2 M i = K 1/2 I i K 1/2. 32

33 This Talk Domain adaptation problem Discrepancy distance Theoretical guarantees Algorithm Experiments 33

34 Experiments Classification: Q and P Gaussians. H: halfspaces. f: interval [-1, +1] w/ min disc w/ orig disc # Training Points 34

35 Experiments Regression: Q Q P # Training Points SDP solved in about 15s using SeDuMi on 3GHz CPU with 2GB memory. 35

36 Conclusion Discrepancy distance: appears as the right measure of difference of distributions for adaptation. Theoretical analysis: generalization bounds and strong guarantees for a large class of algorithms. Algorithm: discrepancy minimization algorithms for other loss functions, more efficient large-scale algorithms. 36

Domain Adaptation for Regression

Domain Adaptation for Regression Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Motivation Applications: distinct training and test distributions.