Learning Bounds for Importance Weighting

Size: px

Start display at page:

Download "Learning Bounds for Importance Weighting"

Ruby Barber
5 years ago
Views:

1 Learning Bounds for Importance Weighting Corinna Cortes Google Research Yishay Mansour Tel-Aviv University Mehryar Mohri Courant & Google

2 Motivation used in variety of contets: sample bias correction (e.g., Dudík et al., 2006; Zadrozny et al., 2003; Huang et al., 2006; Sugiyama et al., 2008). domain adaptation. active learning (Beygelzimer et al., 2009). analysis of boosting (Dasgupta and Long, 2003). Guarantees? when is importance weighting successful? are there better reweighting techniques than the straightforward standard approach? 2

3 Setting Input space X, output space Y. Loss function L: Y Y [0, 1]. Source distribution Q. Target distribution P. Training sample S of size m drawn according to Q. Hypothesis set H. Fied target labeling function f : X Y. Notation: for any X and, h H L h ()=L(h(),f()). 3

4 Importance Weighting Emphasize loss of training point by w() : empirical loss: generalization loss: R w (h)= 1 m m i=1 w( i) L h ( i ). R(h)= E P [L h ()]. Weight assumed known: w()=p ()/Q(). In practice, estimated weight w() : effect of this error specifically analyzed by (Cortes et al., 2008). Different scenarios: importance weighting/sampling. imp. weighting: finite sample of size m some Q. imp. sampling: unlimited sampling, can choose Q. 4

5 Does Importance Weighting Work? Q Q P P Training set size Error Ratio = 0.3 Q P Training set size Error Ratio = 0.75 Q P Hypothesis class: hyperplanes tangent to the unit circle. Best hypothesis chosen by empirical risk minimization.

6 Preliminaries. Outline Learning bounds for bounded importance weights. Learning bounds for unbounded importance weights (the most common case). Alternative reweighting techniques. 6

7 Rényi Divergences Definition: for α 0, D 1 D α D α (P Q) = 1 α 1 log 2 D α 0 Notation: for all α 0. coincides with the relative entropy (KL div.). non-decreasing function of α. d α (P Q) =2 D α(p Q) = P () α 1 P (). Q() P α () 1 α 1 Q α 1 () (Rényi, 1960). 7

8 Properties of w : Properties of : Properties E[w]=1 E[w 2 ]=d 2 (P Q) σ 2 (w)=d 2 (P Q) 1. wl h E Q [ R w (h)]= E Q [w()l h ()]= E Q [w2 () L 2 h()] = P () Q() L h() Q()= 2 P () Q() L 2 Q() h() = P ()L h ()=R(h). P () 1 α P () Q() P () α 1 α L 2 h () P () (Hölder s inequality) P () Q() = d α1 (P Q) α 1 α P () L h ()L α1 α 1 h () α 1 P () L 2α α α 1 h () α 1 α dα1 (P Q) R(h) 1 1 α. 8

9 Learning Bound - Bounded Case Assumption (bounded case): M =sup w()=sup Theorem: let H be a finite hypothesis set. Then, for any δ>0, with probability at least 1 δ, for any h H, R(h) R w (h) 2M(log H log1 δ ) 3m P () Q() =d (P Q)<. similar result for infinite hypothesis sets. note the role of Rényi divergence. 2d 2 (P Q)(log H log 1 δ ) m. 9

10 Lower Bound - Bounded Case Theorem: assume that M< and σ 2 (w)/m 2 1/m. Assume that H contains a hypothesis h 0 such that L h0 ()=1 for all. Then, there eists an absolute constant c=2/41 2 such that Pr sup h H R(h) R w (h) d2 (P Q) 1 c>0. 4m result based on proof of general lower bound theorem for maimal variance. 10

11 Assumption some natural cases. Unbounded Case Eamples: Gaussian distributions. P () = 1 2πσP ep does not hold, even in even for σ P =σ Q and µ=µ, d (P Q)=. but, for σ Q > 2 2 σ P slide 2), of w is bounded. d (P Q)< ( µ)2 2σ 2 P d 2 (P Q)< Q() = 1 2πσQ ep ( µ ) 2 2σ 2 Q (e.g., eample on the right,, thus the second-moment. 11

12 Learning Bound - Unbounded Case Theorem: let H such that is finite. Assume that for all. Then, for any least 1 δ, for all h H, Pdim({L h (): h H})=p d 2 (P Q)< w()=0 δ>0 and, with probability at R(h) R w (h)2 5/4 d 2 (P Q) 3 p log 2me 8 p log 4 δ. m holds even for unbounded weights. based on new proof of general learning bound theorem for unbounded losses with bounded second-moment (Vapnik s proof is incorrect!). 12

13 Proof Ideas 13

14 Alternative Weighting Techniques Arbitrary u: X R with u>0 and for any h H, R u (h)= 1 m m i=1 u( i )L h ( i ). Theorem: let H such that Pdim({L h (): h H})=p is finite. Assume that 0<E[u 2 ()]<. Then, for Q any δ>0, with probability at least 1 δ, for all h H, R(h) R u (h) E Q [w() u()]lh () 2 5/4 ma EQ [u 2 ()L 2 h ()], E bq [u 2 ()L 2 h ()] 38 p log 2me p log 4 δ. m 14

15 Alternative Weighting Techniques Trade-off between bias term and second moment ma E Q [(w() u())l h ()] E Q [u 2 ()L 2 h ()], E bq [u 2 ()L 2 h ()]. Using upper bound independent of optimization problem with γ>0 min u U E Q H w() u() γ E Q [u 2 ], a trade-off parameter. leads to the 15

16 Alternative Reweighting - Eample (1) (2) (3) (4) The variance is reduced in (3) by replacing the average weight per quantile. w with 16

17 Conclusion and Open Questions Learning guarantees for importance weighting, including unbounded case (most common). Analysis of cases where importance weighting can succeed. Critical role of Rényi divergence of the distributions. Preliminary eploration of other reweighting techniques. Estimation of Rényi divergence from finite samples. 17

learning bounds for importance weighting Tamas Madarasz & Michael Rabadi April 15, 2015

learning bounds for importance weighting Tamas Madarasz & Michael Rabadi April 15, 2015 Introduction Often, training distribution does not match testing distribution Want to utilize information about test