Domain Adaptation for Regression

Size: px

Start display at page:

Download "Domain Adaptation for Regression"

Gabriella Sherman
5 years ago
Views:

1 Domain Adaptation for Regression Corinna Cortes Google Research Mehryar Mohri Courant Institute and Google

2 Motivation Applications: distinct training and test distributions. Sentiment analysis: appraisal information for some domains, e.g., movies, books, music, restaurants, but no labels for travel. Language modeling, part-of-speech tagging. Statistical parsing. Speech recognition. Computer vision. Solution critical for applications. This talk: regression problems. 2

3 Domain Adaptation Problem Distributions: source Q, target P. Target function(s): and, or just. f Q Input: training sample drawn from Q, unlabeled sample drawn from P. Problem: find hypothesis h with small expected loss with respect to distribution P, L P (h, f P )= f P f E L h(x),f P (x). x P 3

4 Distribution Mismatch Q P Which distance should we use to compare these distributions? 4

5 Definition: Discrepancy Distance disc(q 1,Q 2 )= max h,h H L Q1 (h,h) L Q2 (h,h) symmetric, verifies triangle inequality, in general not a distance. helps compare distributions for arbitrary losses, e.g. hinge loss, or L p loss. can be estimated from finite samples, Rademacher complexity bounds. (Mansour, MM, Rostami, 2009). 5

6 Previous Work (Ben-David et al., NIPS 2006) & (Blitzer et al., NIPS 2007): bounds for binary classification based on d A distance and λ H term (cannot be estimated). (Mansour, MM, Rostami, COLT 2009): learning bounds and analysis for general loss functions. based on discrepancy and optimal hypotheses. favorable under plausible assumptions. pointwise loss guarantees for kernel algorithms. (Ben-David et al., AISTATS 2010): series of negative results for adaptation in binary classification. 6

7 Theoretical Guarantees Two types of questions: difference between average loss of hypothesis on Q versus P? difference of loss between hypothesis h obtained when training on ( Q, f Q ) versus hypothesis obtained when training on ( P,f P ). h h 7

8 Kernel-Based Reg. (KBR) Algorithms Algorithms minimizing objective function: F bq (h) =λ h 2 K + R bq (h), where K is a PDS kernel, λ>0 is a trade-off parameter, and R bq (h) is the empirical error of h. family of algorithms including SVM, SVR, kernel ridge regression, etc. 8

9 Guarantees for KBR Algorithms Theorem: let K be a PDS kernel with K(x, x) R 2 and L a loss function such that L(,y) is µ -Lipschitz. Assume that f P H, then, for all (x, y) X Y, L(h (x),y) L(h(x),y) disc( P, Q)+µη µr, λ where η =max{l(f Q (x),f P (x)): x supp( Q)}. 9

10 Adaptation Algorithm Search for a new empirical distribution same support: q with q = argmin supp(q) supp( b Q) disc( P,q). Solve modified KBR problem: min h F q (h) = 1 m m i=1 q (x i )L(h(x i ),y i )+λh 2 K. 10

11 Discrepancy Min. - Input space For L2 loss and cast as an SDP (Mansour, MM, Rostami, COLT 2009): H ={x w x: w Λ} minimize M(z) 2 m subject to M(z) =M 0 z i M i i=1 q M 0 = P (x j )x j x j j=m+1 M i = x i x i,i [1, m] z 1 =1 z 0. what about if we want to use kernels?, can be 11

12 Discrepancy Min. with Kernels For L2 loss and it can be cast as a similar SDP: H = {h H: h K Λ} minimize M (z) 2 m subject to M (z) =M 0 z i M i i=1 M 0 = K 1/2 D 0 K 1/2 M i = K 1/2 D i K 1/2 z 1 =1 z 0., proof that but, cannot be solved practically even for a few hundred points, even with best public SDP solvers. 12

13 Smooth Approximation Convex optimization problem: Smooth: C closed convex, algorithm: F O(1/ ) Lipschitz continuous gradient., optimal for problem class. Non-smooth: F Lipschitz continuous. find G uniform -approximation of F. algorithm: O(1/). (Nesterov, 1983, 2005) minimize z C F (z). 13

14 Disc. Min. SDP Problem Smooth approximation: F : z M(z) 2 not differentiable. G p : z 1 : smooth unif. approximation. 2 Tr[M(z)2p ] 1 p Algorithm: J =(M i, M j F ) 1 i,j m. Algorithm 2 u 0 argmin u C u Ju for k 0 do 2p 1 v k argmin u C (u u 2 k ) J(u u k )+ G p (M(u k )) u 2p 1 w k argmin u C (u u 2 0 ) J(u u 0 )+ P k i+1 i=0 G 2 p(m(u i )) u u k+1 2 w k+3 k + k+1 v k+3 k end for Fig. 2. Smooth approximation algorithm. 14

15 Convergence Guarantee Let r =max rank(m i )}. z C rank(m(z)) max{n, n i=0 Theorem: for any >0, the algorithm solves the discrepancy minimization SDP with relative accuracy in O( r log r/) iterations. 15

16 Experiments - Time Seconds o* 80 sec. o * o * o * o * QP optimization grad G computation oo oo * * * *!=3 Seconds 1e+01 1e+03 1e+05 o X o X X X o o X o! =3 o o Algorithm 2 SeDuMi oo Set size Set size 16

17 Experiments - Performance Adaptation to Books Adaptation to Dvd Adaptation to Elec Adaptation to Kitchen RMSE books dvd elec kitchen RMSE books dvd elec kitchen RMSE books dvd elec kitchen RMSE books dvd elec kitchen Unlabeled set size Unlabeled set size Unlabeled set size Unlabeled set size Fig. 3. Multi-domain sentiment analysis data set (Blitzer et al. 2007): books, dvd, elec, kitchen. Treated as regression task. 17

18 Conclusion Theoretical results for DA in regression. new pointwise loss guarantees for general class of loss functions. disc. min. adaptation extended to kernels. Efficient algorithm for solving discrepancy minimization. shown to scale to relatively large data sets. empirically shown to be effective. Still many adaptation questions left to address! 18

Learning with Imperfect Data

Learning with Imperfect Data Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.