Learning Bounds for Importance Weighting

Size: px

Start display at page:

Download "Learning Bounds for Importance Weighting"

Matilda Brooks
6 years ago
Views:

1 Learning Bounds for Iportance Weighting Corinna Cortes Google Research New York, NY 00 Yishay Mansour Tel-Aviv University Tel-Aviv 69978, Israel Mehryar Mohri Courant Institute and Google New York, NY 002 Abstract This paper presents an analysis of iportance weighting for learning fro finite saples and gives a series of theoretical and algorithic results We point out siple cases where iportance weighting can fail, which suggests the need for an analysis of the properties of this technique We then give both upper and lower bounds for generalization with bounded iportance weights and, ore significantly, give learning guarantees for the ore coon case of unbounded iportance weights under the weak assuption that the second oent is bounded, aconditionrelatedtotherényidivergenceofthetrainingand test distributions These results are based on a series of novel and general boundswederiveforunbounded loss functions, which are of independent interest We use these bounds to guide the definition of an alternative reweighting algorith andreporttheresults of eperients deonstrating its benefits Finally, we analyze the properties of noralized iportance weights which are also coonly used Introduction In real-world applications of achine learning, often the sapling of the training and test instances ay differ, which results in a isatch between the two distributions For eaple, in web search applications, there ay be data regarding users who clicked on soe advertiseent link but little or no inforation about other users Siilarly, in credit default analyses, there is typically soe inforation available about the credit defaults of custoers who were granted credit, but no such inforation is at hand about rejected costuers In other probles such as adaptation, the training data available is drawn fro a source doain different fro the target doain These issues of biased sapling or adaptation have been long recognized and studied in the statistics literature There is also a large body of literature dealing with different techniques for saple bias correction, 29, 6, 8, 25, 6] or doain adaptation 3, 7, 9, 0, 7] in the recent achine learning and natural language processing literature Acoontechniqueusedinseveralofthesepublicationsforcorrecting the bias or discrepancy is based on the so-called iportance weighting technique This consists of weighting the cost of errors on training instances to ephasize the error on soe or de-ephasize it on others, with the objective of correcting the isatch between the distributions of training and test points, as in saple bias correction, adaptation, and other related contets such as active learning 24, 4, 8, 9, 5] Different definitions have been adopted for these weights A coon definition of the weight for point is w()=p ()/() where P is the target or test distribution and is the distribution according to which training points are drawn A favorable property of thisdefinition,whichisnothardtoverify, is that it leads to unbiased estiates of the generalization error 8] This paper presents an analysis of iportanceweighting for learning fro finite saples Our study was originally otivated by the observation that, while this corrective technique sees natural, in soe cases in practice it does not succeed An eaple in diension two is illustrated by Figure The target distribution P is the even iture of two Gaussians centered at (0, 0) and (0, 2) both with

2 4 3 2 σ P σ P Ratio σ σ = 03 P Ratio σ σ = 075 P 0 σ σ Figure : aple of iportance weighting Left figure: P (in blue) and (in red) are even itures of Gaussians The labels are positive within the unit sphere centered at the origin (in grey), negative elsewhere The hypothesis class is that of hyperplanes tangent to the unit sphere Right figures: plots of test error vs training saple size using iportance weighting for two different values of the ratio σ /σ P Theresultsindicateeanvaluesoftheerrorover40runs± one standard deviation standard deviation σ P,whilethesourcedistribution is the even iture of two Gaussians centered at (0, 0) and (2, 0) but with standard deviation σ The hypothesis class is that of hyperplanes tangent to the unit sphere The best classifier is selected by epirical risk iniization As shown in Figure, for σ P /σ =3,theerrorofthehypothesislearnedusingiportanceweighting is close to 50% even for a training saple of 5,000 points and the standard deviation of the error is quite high In contrast, for σ P /σ =75,convergenceoccursrelativelyrapidlyandlearningissuccessful In Section 4, we discuss other eaples where iportance weighting does not succeed The proble just described is not liited to isolated eaples Siilar observations have been ade in the past in both the statistics and learning literature, ore recently in the contet of the analysis of boosting by 9] who suggest that iportance weighting ustbeusedwithcareandhighlightthe need for convergence bounds and learning guarantees for thistechnique We study the theoretical properties of iportance weighting We show using standard generalization bounds that iportance weighting can succeed when the weights are bounded However, this condition often does not hold in practice We also show that, rearkably, convergence guarantees can be given even for unbounded weights under the weak assuption that the second oent of the weights is bounded, a condition that relates to the Rényi divergence of P and We further etend these bounds to guarantees for other possible reweightings These results suggest iniizing a biasvariance tradeoff that we discuss and that leads to several algorithic ideas We eplore in detail an algorith based on these ideas and report the results of eperients deonstrating its benefits Throughout this paper, we consider the case where the weight function w is known When it is not, it is typically estiated fro finite saples The effect ofthisestiationerrorisspecifically analyzed by 8] This setting is closely related to the proble of iportance sapling in statistics which is that of estiating the epectation of a rando variable according to P while using a saple drawn according to, withw given 8] Here, we are concerned with the effect of the weights on learning fro finite saples A different setting is when further full access to is assued, von Neuann s rejection sapling technique 28] can then be used We note however that it requires w to be bounded by soe constant M, whichisoftennotguaranteedandisthesiplestcaseofour bounds ven then, the ethod is wasteful as it requires on average M saples to obtain one point The reainder of this paper is structured as follows Section2introducesthedefinitionoftheRényi divergences and gives soe basic properties of the iportance weights In Section 3, we give generalization bounds for iportance weighting in the bounded case We also present a general lower bound indicating the key role played by the Rényi divergenceofp and in this contet Section 4 deals with the ore frequent case of unbounded w Standard generalization bounds do not apply here since the loss function is unbounded We give novel generalization bounds for unbounded loss functions under the assuption that the second oent is bounded (see Appendi) and use the to derive learning guarantees for iportance weighting in this oregeneralsetting InSection5,we discuss an algorith inspired by these guarantees for which we report preliinary eperiental results We also discuss why the coonly used reedy of truncating or capping iportance weights ay not always provide the desired effect of iproved perforance Finally, in Section 6, we study 2

3 the properties of an alternative reweighting also coonly used which is based on noralized iportance weights, and discuss its relationship with the (unnoralized) weights w 2 Preliinaries Let X denote the input space, Y the label set, and let L: Y Y 0, ] be a loss function We denote by P the target distribution and by the source distribution according to which training points are drawn We also denote by H the hypothesis set used by the learning algorith and by f : X Y the target labeling function 2 Rényi divergences Our analysis akes use of the notion of Rényi divergence, an inforation theoretical easure of the difference between two distributions directly relevanttothestudyofiportanceweightingfor α 0,theRényidivergenceD α (P ) between distributions P and is defined by 23] D α (P ) = α log 2 P () ( P () () ) α () The Rényi divergence is a non-negative quantity and for any α>0, D α (P )=0iff P = For α =,itcoincideswiththerelativeentropywedenotebyd α (P ) the eponential in base 2 of the Rényi divergence D α (P ): d α (P ) =2 Dα(P ) P α ] α () = (2) α () 22 Iportance weights The iportance weight fordistributions P and is defined by w() =P ()/() In the following, the epectations are taken with respect to Lea The following identities hold for the epectation, second oent, and variance of w: w] = w 2 ]=d 2 (P ) σ 2 (w) =d 2 (P ) (3) Proof The first equality is iediate The second oent of w can be epressed as follows in ters of the Rényi divergence: w 2 ]= w 2 () () = ( ) 2 P () () = ( ) P () P () = d 2 (P ) () () X X X Thus, the variance of w is given by σ 2 (w) = w 2 ] w] 2 = d 2 (P ) For any hypothesis h H, wedenotebyr(h) its loss and by R w (h) its weighted epirical loss: R(h) = L(h(),f())] R w (h) = w( i ) L(h( i ),f( i )) P We shall use the abbreviated notation L h () for L(h(),f()), intheabsenceofanyabiguity about the target function f Notethattheunnoralizediportanceweightingofthelossisunbiased: w()l h ()] = P () () L h() () = P ()L h () =R(h) The following lea gives a bound on the second oent Lea 2 For all α>0 and X, thesecondoentoftheiportanceweightedlosscanbe bounded as follows: w2 () L 2 h ()] d α(p ) R(h) α (4) For α =,thisbecoesr(h) 2 w 2 () L 2 h ()] d 2(P ) i= 3

4 Proof Thesecondoentcanbeboundedasfollows: w2 () L 2 h ()] = ] 2 P () () L 2 h () () = P () P () α () ] α ] P () α P () P () L 2α α h () () ] α = d α (P ) P () L h ()L α α α () h ] P () α α L 2 h () ] α α d α (P ) R(h) α B α = dα (P ) R(h) α 3 Learning Guarantees -Bounded Case (Hölder s inequality) P () Note that sup w()=sup () =d (P ) Wefirsteainethecased (P )< and use the notation M =d (P ) ThefollowingpropositionfollowsthendirectlyHoeffding s inequality Proposition (single hypothesis) Fi h H Foranyδ>0,withprobabilityatleast δ, R(h) R log(2/δ) w (h) M 2 The upper bound M, thoughfinite,canbequitelarge Thefollowingtheoreprovides a ore favorable bound as a function of the ratio M/ when any of the oents of w, d α (P ), is finite, which is the case when d (P ) < since the Rényi divergence is a non-decreasing function of α 23, 2], in particular: α >0, d α (P ) d (P ) (5) Theore (single hypothesis) Fi h H Then,foranyα, foranyδ>0, withprobabilityat least δ,thefollowingboundholdsfortheiportanceweightingethod: R(h) R w (h) 2M log δ 3 2 d α (P ) R(h) α R(h) 2] log δ (6) For α =after further siplification, this gives R(h) R w (h) 2M log δ 3 2d 2(P )log δ Proof LetZ denote the rando variable w() L h ()R(h) Then, Z M By lea 2, the variance of the rando variable Z can be bounded in ters of the Rényi divergence d α (P ): σ 2 (Z) = w 2 () L h () 2 ] R(h) 2 d α (P ) R(h) α R(h) 2 Thus, by Bernstein s inequality 4], it follows that: ( PrR(h) R w (h) >ɛ] ep ɛ 2 /2 σ 2 (Z)ɛM/3 Setting δ to atch this upper bound shows that with probability at least δ, thefollowingbound holds for the iportance weighting ethod: R(h) R w (h) M log δ 3 M 2 log 2 δ 9 2 2σ2 (Z)log δ Using the sub-additivity of leads to the sipler epression R(h) R w (h) 2M log δ 3 ) 2σ 2 (Z)log δ These results can be straightforwardly etended to general hypothesis sets In particular, for a finite hypothesis set and for α =,theapplicationoftheunionboundyieldsthefollowingresult 4

5 Theore 2 (finite hypothesis set) Let H be a finite hypothesis set Then, for any δ>0, with probability at least δ,thefollowingboundholdsfortheiportanceweightingethod: R(h) R w (h) 2M(log H log δ ) 2d 2 (P )(log H log δ ) (7) 3 For infinite hypothesis sets, a siilar result can be shown straightforwardlyusing coveringnubers instead of H or a related easure based on saples of size 20] In the following proposition, we give a lower bound that further ephasizes the role of the Rényi divergence of the second order in the convergence of iportance weighting in the bounded case Proposition 2 (Lower bound) Assue that M< and σ 2 (w)/m 2 / Assue that H contains a hypothesis h 0 such that L h0 ()=for all Then,thereeistsanabsoluteconstantc, c=2/4 2,suchthat Pr sup R(h) R w (h) ] d2 (P ) c>0 (8) h H 4 Proof Let σ H =sup h H σ(wl h )Ifforall X, L h0 ()=,thenσ 2 (wl h0 )=d 2 (P ) = σ 2 (w)=σh 2 Theresultthenfollowsageneraltheore,Theore9provenin the Appendi 4 Learning Guarantees -Unbounded Case The condition d (P ) < assued in the previous section does not always hold, even in soe natural cases, as illustrated by the following eaples 4 aples Assue that P and both follow a Gaussian distribution with the standard deviations σ P and σ and with eans µ and µ : ] ( µ)2 P () = ep 2πσP 2σP 2 () = ep ( µ ) 2 ] 2πσ 2σ 2 ] In that case, P () () = σ σ P ep σ2 (µ)2 σ 2 P (µ ) 2,thus,evenforσ 2σP 2 P = σ and µ µ the σ2 P () iportance weights are unbounded, d (P ) =sup () =, andtheboundoftheore is not inforative The Rényi divergence of the second orderisgivenby: d 2 (P ) = σ ep σ2 ( µ)2 σp 2 ( µ ) 2 σ P = σ σp 2 2π 2σ 2 P σ2 ] P ()d ep 2σ2 ( µ)2 σp 2 ( µ ) 2 2σ 2 P σ2 ] d That is, for σ > 2 2 σ P the variance of the iportance weights is bounded By the additivity property of the Rényi divergence, a siilar situation holdsfortheproductandsusofsuchgaussian distributions Hence, in the rightost eaple of Figure, the iportance weights are unbounded, but their second oent is bounded In the net section we provide learning guarantees even for this setting in agreeent with the results observed For σ =03σ P,thesaefavorableguarantees do not hold, and, as illustrated in Figure, learning is significantly ore difficult This eaple of Gaussians can further illustrate what can go wrong in iportance weighting Assue that µ = µ =0, σ =and σ P =0Onecouldhaveepectedthistobeaneasycasefor iportance weighting since sapling fro provides useful inforation about P The proble is, however, that a saple fro will contain a very sall nuber of points far fro the ean (of either negative or positive label) and that these points will be assigned very large weights For asapleofsize and σ =,theepectedvalueofanetreepointis 2log o() and its 5

6 weight will be in the order of /σ2 P /σ2 = 099 Therefore,afewetreepointswilldoinate all other weights and necessarily have a huge influence on theselectionofahypothesisbythe learning algorith Another related eaple is when σ = σ P =and µ =0Letµ 0 depend on the saple size If µ is large enough copared to log(), then,with high probability,all the weights will be negligible This is especially probleatic, since the estiate of the probability of any event would be negligible (in fact both an event and its copleent) If we noralizetheweights,theissue is overcoe, but then, with high probability, the aiu weight doinates the su of all other weights, reverting the situation back to that of the previouseaple 42 Iportance weighting learning bounds - unbounded case As in these eaples, in practice, the iportance weights are typicallynotboundedhowever,we shall show that, rearkably, under the weak assuption that the second oent of the weights w, d 2 (P ), isbounded,generalizationboundscanbegivenforthiscaseaswell Thefollowing result relies on a general learning bound for unbounded loss functions proven in the Appendi (Corollary ) We denote by Pdi(U) the pseudo-diension of a real-valued function class U 2] Theore 3 Let H be a hypothesis set such that Pdi({L h (): h H}) =p< Assuethat d 2 (P ) < and w() 0for all Then,foranyδ>0, withprobabilityatleast δ, the following holds: R(h) R w (h)2 5/4 d 2 (P ) 3 p log 2e 8 p log 4 δ Proof Since d 2 (P ) <, thesecondoentofw()l h () is finite and upper bounded by d 2 (P ) (Lea 2) Thus, by Corollary, we can write R(h) Pr sup R ] ( w (h) >ɛ 4ep p log 2e ) ɛ8/3, h H d2 (P ) p 4 5/3 where p is the pseudo-diension of the function class H = {w()l h (): h H} Wenowshow that p =Pdi({L h (): h H}) LetH denote {L h (): h H} LetA = {,, k } be a set shattered by H Then,thereeistrealnubersr,,r k such that for any subset B A there eists h H such that i B, w( i )L h ( i ) r i i A B, w( i )L h ( i ) <r i (9) Since by assuption w( i )>0 for all i,k],thisipliesthat i B, L h ( i ) r i /w( i ) i A B, L h ( i ) <r i /w( i ) (0) Thus, H shatters A with the witnesses s i = r i /w( i ), i,k] Usingthesaeobservations,itis straightforward to see that conversely, any set shattered by H is shattered by H The convergence rate of the bound is slightly weaker (O( 3/8 ))thanintheboundedcase (O( /2 )) A faster convergence can be obtained however using the orepreciseboundoftheore 8 at the epense of readability The Rényi divergence d 2 (P ) sees to play a critical role in the bound and thus in the convergence of iportance weighting in the unbounded case 5 Alternativereweighting algoriths The previous analysis can be generalized to the case of an arbitrary positive function u: X R, u>0 Let R u (h)= i= u( i)l h ( i ) and let denote the epirical distribution Theore 4 Let H be a hypothesis set such that Pdi({L h (): h H})=p< Assuethat 0 < u 2 ()] < and u() 0for all Then,foranyδ>0, withprobabilityatleast δ, the following holds: R(h) R u (h) w() u()]lh () ] 2 5/4 a ( u 2 ()L 2 h ()], b u 2 ()L 2 h ()] ) 38 p log 2e p log 4 δ 6

7 Unweighted, Ratio σp σ = 075 Iportance, Ratio σ P σ = 075 uantile, Ratio σp σ = 075 Capped %, Ratio σp σ = Figure 2: Coparison of the convergence of 4 different algoriths for the learning task of Figure : learning with equal weights for all eaples (Unweighted), Iportance weighting, using uantiles to paraeterize the function u, andcappingthelargestweights Proof Since R(h) =w()l h ()], wecanwrite R(h) R u (h) = w() u()]lh () ] u()l h ()] R u (h), and thus R(h) R u (h) w() u()]lh () ] u()lh ()] R u (h) By Corollary 2 applied to the function ul h, u()l h ()] R u (h) can be bounded by p log 2e log 4 δ 2 5/4 a( u 2 ()L 2 h ()], b u 2 ()L 2 h ()]) 3 8 p with probability δ, with p =Pdi({L h (): h H}) by a proof siilar to that of Theore 3 The theore suggests that other functions u than w can be used to reweight the cost of an error on each training point by iniizing the upper bound, which is atrade-offbetweenthebiaster (w()u())l h ()] and the second oent a ( u 2 ()L 2 h ()], b u 2 ()L 2 h ()]), where the coefficients are eplicitly given Function u can be selected fro different failies Using an upper bound on these quantities that is independent of h and a ultiplicative bound of the for a ( u 2 ], b u 2 ] ) u 2 ] ( O(/ ) ), leads to the following optiization proble: in ] w() u() γ u 2 ], () u U where γ>0 is a paraeter controlling the trade-off between bias and variance iniization and where U is a faily of possible weight functions out of which u is selected Here, we consider a faily of functions U paraeterized by the quantiles q of the weight function w Afunctionu q U is then defined as follows: within each quantile, the value taken by u q is the average of w over that quantile For sall values of γ,thebiasterdoinates,andveryfine-grained quantiles iniize the bound of equation () For large values of γ the variance ter doinates and the bound is iniized by using just one quantile, corresponding to an even weighting of the training eaples Hence by varying γ fro sall to large values, the algorith interpolates between standard iportance weighting with just one eapleperquantile,andunweightedlearning where all eaples are given the sae weight Figure 2 also shows the results of eperients for the learning task of Figure using the algorith defined by () with this faily of functions The optial q is deterined by 0-fold cross-validation We see that a ore rapid convergence can be obtained by using these weights copared to the standard iportance weights w Another natural faily of functions is that of thresholded versions of the iportance weights {u θ : θ>0, X, u θ ()=in(w(),θ)} Infact,inpractice,usersoftencapiportanceweights by choosing an arbitrary value θ Theadvantageofthisfailyisthat,bydefinition,theweights are 7

8 bounded However, in soe cases, larger weights could be critical to achieve a better perforance Figure 2 illustrates the perforance of this approach Copared to iportance weighting, no change in perforance is observed until the largest % of the weights are capped, in which case we only observe a perforance degradation We epect the thresholding to be less beneficial when the large weights reflect the true w and are not an artifact of estiation uncertainties 6 Relationshipbetween noralizedand unnoralized weights An alternative approach based on the weight function w = P ()/() consists of noralizing the weights Thus, while in the unnoralized case the unweightedepiricalerrorisreplacedby w( i ) w( i ) L h ( i )= L h( i ), i= in the noralized case it is replaced by i= i= w( i ) W L h( i ), with W = i= w( i) We refer to ŵ() =w()/w as the noralized iportance weight An advantage of the noralized weights is that they are by definition bounded by one However, the price to pay for this benefit is the fact that the weights are no ore unbiased In fact, several issues siilar to those we pointed out in the Section 4 affect the noralized weights as well Here, we aintain the assuption that the second oent of the iportanceweightsisbounded and analyze the relationship between noralized and unnoralized weights We show that, under this assuption, noralized and unnoralized weights are in fact very close, with high probability Observe that for any i,], ŵ( i ) w( i) = w( i) W ] = w( i) W ] W Thus, since w(i) W, wecanwrite ŵ(i ) w(i) W Since w()]=, wealsohave S W ]= k= w( k)]= Thus,byCorollary2,foranyδ>0,withprobabilityatleastδ, the following inequality holds W { } 25/4 a d 2 (P ), d 2 (P ) which iplies the sae upper bound on ŵ( i ) w(i) 7 Conclusion 3 8 log 2e log 4 δ,,siultaneouslyforalli,] We presented a series of theoretical results for iportance weighting both in the bounded weights case and in the ore general unbounded case under the assuption that the second oent of the weights is bounded We also initiated a preliinaryeploration of alternative weights and showed its benefits A ore systeatic study of new algoriths based on these learning guarantees could lead to even ore beneficial and practically useful results Several of the learning guarantees we gave depend on the Rényi divergence of the distributions P and Accuratelyestiatingthatquantity is thus critical and should otivate further studies of the convergence of its estiates fro finite saples Finally, our novel unbounded loss learning bounds are of independent interest and could be useful in a variety of other contets References ] M Anthony and J Shawe-Taylor A result of Vapnik with applications Discrete Applied Matheatics,47:207 27,993 8

9 2] C Arndt Inforation Measures: Inforation and its Description in Science and ngineering Signals and Counication Technology Springer Verlag, ] S Ben-David, J Blitzer, K Craer, and F Pereira Analysis of representations for doain adaptation NIPS,2007 4] S N Bernstein Sur l etension du théorèe liite du calcul des probabilités au soes de quantités dépendantes Matheatische Annalen,97: 59,927 5] A Beygelzier, S Dasgupta, and J Langford IportanceweightedactivelearningInICML, pages 49 56, New York, NY, USA, ] S Bickel, M Brückner, and T Scheffer Discriinativelearningfordifferingtrainingandtest distributions In ICML, pages8 88,2007 7] J Blitzer, K Craer, A Kulesza, F Pereira, and J Wortan Learning bounds for doain adaptation NIPS 2007,2008 8] C Cortes, M Mohri, M Riley, and A Rostaizadeh Sapleselectionbiascorrectiontheory In ALT, ] S Dasgupta and P M Long Boosting with diverse base classifiers In COLT, ] H Daué III and D Marcu Doain adaptation for statistical classifiers Journal of Artificial Intelligence Research,26:0 26,2006 ] M Dudík, R Schapire, and S J Phillips Correcting sapleselectionbiasinaiu entropy density estiation In NIPS,2006 2] R M Dudley A course on epirical processes Lecture Notes in Math,097:2 42,984 3] R M Dudley Universal Donsker classes and etric entropy Annals of Probability,4(4): ,987 4] C lkan The foundations of cost-sensitive learning In IJCAI,pages ,200 5] D Haussler Decision theoretic generalizations of the PACodelforneuralnetandother learning applications Inf Coput, 00():78 50, 992 6] J Huang, A J Sola, A Gretton, K M Borgwardt, and B Schölkopf Correcting saple selection bias by unlabeled data In NIPS,volue9,pages60 608,2006 7] J Jiang and C Zhai Instance Weighting for Doain Adaptation in NLP In ACL, ] J S Liu Monte Carlo strategies in scientific coputing Springer,200 9] Y Mansour, M Mohri, and A Rostaizadeh Doain adaptation: Learning bounds and algoriths In COLT, ] A Maurer and M Pontil pirical bernstein bounds and saple-variance penalization In COLT, Montréal,Canada,June2009Onipress 2] D Pollard Convergence of Stochastic Processess Springer,NewYork,984 22] D Pollard Asyptotics via epirical processes Statistical Science,4(4):34 366,989 23] A Rényi On easures of inforation and entropy In Proceedings of the 4th Berkeley Syposiu on Matheatics, Statistics and Probability,page54756,960 24] H Shiodaira Iproving predictive inference under covariate shift by weighting the loglikelihood function Journal of Statistical Planning and Inference, 90(2): , ] M Sugiyaa, S Nakajia, H Kashia, P von Bünau, and M Kawanabe Direct iportance estiation with odel selection and its application to covariate shift adaptation In NIPS, ] V N Vapnik Statistical Learning Theory JohnWiley&Sons,998 27] V N Vapnik stiation of Dependences Based on pirical Data, 2nd ed Springer, ] J von Neuann Various techniques used in connection with rando digits Monte Carlo ethods Nat Bureau Standards,2:36 38,95 29] B Zadrozny, J Langford, and N Abe Cost-sensitive learning by cost-proportionate eaple weighting In ICDM,

Learning Bounds for Importance Weighting

Learning Bounds for Importance Weighting Corinna Cortes Google Research corinna@google.com Yishay Mansour Tel-Aviv University mansour@tau.ac.il Mehryar Mohri Courant & Google mohri@cims.nyu.edu Motivation