learning bounds for importance weighting Tamas Madarasz & Michael Rabadi April 15, 2015
Introduction Often, training distribution does not match testing distribution Want to utilize information about test distribution Correct bias or discrepancy between training and testing distributions 1
Importance Weighting Labeled training data from source distribution Q Unlabeled test data from target distribution P Weight the cost of errors on training instances. Common definition of weight for point x: w(x) = P(x)/Q(x) 2
Importance Weighting Reasonable method, but sometimes doesn t work Can we give generalization bounds for this method? When does DA work? When does it not work? How should we weight the costs? 3
Overview Preliminaries Learning guarantee when loss is bounded Learning guarantee when loss is unbounded, but second moment is bounded Algorithm 4
Preliminaries: Rényi divergence For α 0, D α (P Q) between distributions P and Q D α (P Q) = 1 α 1 log 2 d α (P Q) = 2 D α(p Q) = ( P(x) P(x) Q(x) x [ x ) α 1 ] 1 P α α 1 (x) Q α 1 (x) Metric of info lost when Q is used to approximate P D α (P Q) = 0 iff P = Q 5
Preliminaries: Importance weights Lemma 1: Proof: E[w] = 1 E[w 2 ] = d 2 (P Q) σ 2 = d 2 (P Q) 1 E Q [w 2 ] = x X w 2 (x)q(x) = x X ( ) 2 P(x) Q(x) = d 2 (P Q) Q(x) Lemma 2: For all α > 0 and x X, E Q [w 2 (x)l 2 h (x)] d α+1(p Q)R(h) 1 1 α 6
Preliminaries: Importance weights Hölder s Inequality (Jin, Wilson, and Nobel, 2014): Let 1 p + 1 q = 1, then ( ) 1 ( a x b x a x p p ) 1 b x q q x x x 7
Preliminaries: Importance weights Proof for Lemma 2: Let the loss be bounded by B = 1, then E x Q [w 2 (x)l 2 h (x)] = x [ ] 2 P(x) Q(x) L 2 h Q(x) (x) = x P(x) 1 α [ ] P(x) P(x) α 1 α L 2 Q(x) h (x) [ [ ] α ] 1 [ P(x) α P(x) P(x)L 2α h Q(x) x [ = d α+1 (P Q) x x P(x)L h (x)l α+1 α 1 h α 1 ] α 1 α (x) ] α 1 α (x) d α+1 (P Q)R(h) 1 1 α B 1+ 1 α = dα+1 (P Q)R(h) 1 1 α 8
Learning Guarantees: Bounded case P(x) sup x w(x) = sup x Q(x) = d (P Q) = M. Let d (P Q) < +. Fix h H. Then, for any δ > 0, with probability at least 1 δ, R(h) ˆR w (h) M log 2 δ 2m M can be very large, so we naturally want a more favorable bound... 9
Overview Preliminaries Learning guarantee when loss is bounded Learning guarantee when loss is unbounded, but second moment is bounded Algorithm 10
Learning Guarantees: Bounded case Theorem 1: Fix h H. For any α 1, for any δ > 0, with probability at least 1 δ, the following bound holds: R(h) ˆR w (h) + 2M log 1 δ 3m + 2[d α+1 (P Q)R(h) 1 1 α R(h) 2 ] log 1 δ m 11
Learning Guarantees: Bounded case Bernstein s inequality (Bernstein 1946): when x i M. ( 1 Pr n n i=1 ) ( nϵ 2 ) x i ϵ exp 2σ 2 + 2Mϵ/3 12
Learning Guarantees: Bounded case Proof of Theorem 1: Let Z be the random variable w(x)l h (x) R(x). Then Z M. Thus, by lemma 2, the variance of Z can be bounded in terms of d α+1 (P Q): σ 2 (Z) = E Q [w 2 (x)l h (x) 2 )] R(h) 2 d α+1 (P Q)R(h) 1 1 α R(h) 2 ( mϵ Pr[R(h) ˆR 2 ) /2 w (h) > ϵ] exp σ 2. (Z) + ϵm/3 13
Learning Guarantees: Bounded case Thus, setting δ to match upper bound, then with probability at least 1 δ R(h) ˆR w (h) + 2M log 1 δ 3m + M 2 log 2 1 δ 9m 2 + 2σ2 (Z) log 1 δ m = ˆR w (h) + 2M log 1 δ 3m + 2σ 2 (Z) log 1 δ m 14
Learning Guarantees: Bounded case Theorem 2: Let H be a finite hypothesis set. Then for any δ > 0, with probability at least 1 δ, the following holds for the importance weighting method: R(h) ˆR w (h) + 2M(log H + log 1 δ ) 3m + 2d 2 (P Q)(log H = log 1 δ m 15
Learning Guarantees: Bounded case Theorem 2 holds when α = 1. Note that theorem 1 can be simplified in the case of α = 1: R(h) ˆR w (h) + 2M log 1 δ 3m + 2d 2 (P Q) log 1 δ m Thus, theorem 2 follows by including the cardinality of H 16
Learning Guarantees: Bounded case Proposition 2: Lower bound. Assume M < and σ 2 (w)/m 2 1/m. Assume there exists h 0 H such that L h0 (x) = 1 for all x. There exists an absolute constant c, c = 2/41 2, such that [ ] d2 (P Q) 1 Pr sup R(h) ˆR w (h) c > 0 h H 4m Proof from theorem 9 of Cortes, Mansour, and Mohri, 2010. 17
Overview Preliminaries Learning guarantee when loss is bounded Learning guarantee when loss is unbounded, but second moment is bounded Algorithm 18
Learning Guarantees: Unbounded case d (P Q) < does not always hold... Assume P and Q follow a Guassian distribution with σ P and σ Q with means µ and µ P(x) Q(x) = σ [ P exp σ2 Q (x µ)2 σp 2(x µ ) 2 ] σ Q 2σ 2 P σ2 Q Thus, even if σ P = σ Q and µ µ P(x), d (P Q) = sup x Q(x) =, thus Theorem 1 is not informative. 19
Learning Guarantees: Unbounded case However, the variance of the importance weights is bounded. d w (P Q) = σ 2 P σ + Q 2π [ exp 2σ2 Q (x µ)2 σp 2(x µ ) 2 ] 2σP 2 σq 2 dx 20
Learning Guarantees: Unbounded case Intuition: if µ = µ and σ P >> σ Q Q provides some useful information about P But sample from Q only has few points far from µ A few extreme sample points would have large weights Likewise, if σ P = σ Q but µ >> µ, weights would be negligible. 21
Learning Guarantees: Unbounded case Theorem 3: Let H be a hypothesis set such that Pdim({L h (x) : H H}) = p <. Assume that d 2 (P Q) < + and w(x) 0 for all x. Then for δ > 0, with probability at least 1 δ, the following holds: R(h) ˆR w (h) + 2 5/4 8 d 2 (P Q) 3 p log 2me p + log 4 δ m 22
Learning Guarantees: Unbounded case Proof outline (full proof in of Cortes, Mansour, Mohri, 2010): [ ] E[L Pr sup h ] Ê[L h ] Ê[L h H > ϵ 2 + log 1 2 h ] ϵ [ ] ˆPr[L h>t] Pr[L Pr sup h >t] h H,t R > ϵ ˆPr[L h >t] [ ] ( ) R(h) ˆR(h) Pr sup h H > ϵ 2 + log 1 R(h) ϵ 4Π H (2m) exp mϵ2 4 [ ] Pr sup h H E[L h(x)] Ê[L h (x)] > ϵ 2 + log 1 E[L 2 h (x)] ϵ ( ) 4 exp p log 2em p mϵ2 4 [ ] ( ) E[L Pr sup h (x)] Ê[L h (x)] h H > ϵ 4 exp p log 2em E[L 2 h (x)] p mϵ8/3 4 5/3 E[L h (x)] Ê[L h (x)] 2 5/4 max{ E[L 2 h Ê[L (x)], 2 8 h (x)]} 3 p log 2me p +log 8 δ m 23
Learning Guarantees: Unbounded case Thus, we can show the following: Pr [ ] ( R(h) ˆR w (h) sup > ϵ 4 exp p log 2em ) h H d2 (P Q) p mϵ8/3. 4 5/3 Where p = Pdim({L h (x) : h H} is the pseudo-dimension of H = {w(x)l h (x) : h H}. Note, any set shattered by H is shattered by H, since there exists a subset B of a set A that is shattered by H, such that H shatters A with witnesses s i = r i /w(x i ). 24
Overview Preliminaries Learning guarantee when loss is bounded Learning guarantee when loss is unbounded, but second moment is bounded Algorithm 25
Alternative algorithms We can generalize this analysis to an arbitrary function u : X R, u > 0. Let ˆR u (h) = 1 m m i=1 u(x i)l h (x i ) and let ˆQ be the empirical distribution: Theorem 4: Let H be a hypothesis set such that Pdim({L h (x) : h H}) = p <. Assume that 0 < E Q [u 2 (x)] < + and u(x) 0 for all x. Then for any δ > 0 with probability at least 1 δ, R(h) ˆR u (h) E Q [ [w(x) u(x)]lh (x) ] ) +2 5/4 max ( E Q [u 2 (x)l 2h (x)], EˆQ[u 2 (x)l 2 h (x)] 38 p log 2me p + log 4 δ m 26
Alternative algorithms Other functions u than w can be used to reweight cost of error Minimize upper bound ( EQ ) max [u 2 ], EˆQ[u 2 ] E Q [u 2 ] ( 1 + O(1/ m) ), [ ] min u U E w(x u(x) + γ E Q [u 2 ] Trade-off between bias and variance minimization. 27
Alternative algorithms 28
Alternative algorithms 29