arxiv: v4 [cs.lg] 4 Apr 2016

Size: px

Start display at page:

Download "arxiv: v4 [cs.lg] 4 Apr 2016"

Noah Brown
5 years ago
Views:

1 e-publication Relative Deviation Learning Bounds and Generalization with Unbounded Loss Functions arxiv:35796v4 cslg 4 Apr 6 Corinna Cortes Google Research, 76 Ninth Avenue, New York, NY Spencer Greenberg Courant Institute, 5 Mercer Street, New York, NY Mehryar Mohri Courant Institute and Google Research 5 Mercer Street, New York, NY Editor: TBD Abstract CORINNA@GOOGLECOM SPENCERG@CIMSNYUEDU MOHRI@CIMSNYUEDU We present an extensive analysis of relative deviation bounds, including detailed proofs of twosided inequalities and their iplications We also give detailed proofs of two-sided generalization bounds that hold in the general case of unbounded loss functions, under the assuption that a oent of the loss is bounded These bounds are useful in the analysis of iportance weighting and other learning tasks such as unbounded regression Keywords: Generalization bounds, learning theory, unbounded loss functions Introduction Most generalization bounds in learning theory hold only for bounded loss functions This includes standard VC-diension bounds Vapnik, 998, Radeacher coplexity Koltchinskii and Panchenko, ; Bartlett et al, a; Koltchinskii and Panchenko, ; Bartlett and Mendelson, or local Radeacher coplexity bounds Koltchinskii, 6; Bartlett et al, b, as well as ost other bounds based on other coplexity ters This assuption is typically unrelated to the statistical nature of the proble considered but it is convenient since when the loss functions are uniforly bounded, standard tools such as Hoeffding s inequality Hoeffding, 963; Azua, 967, McDiarid s inequality McDiarid, 989, or Talagrand s concentration inequality Talagrand, 994 apply There are however natural learning probles where the boundedness assuption does not hold This includes unbounded regression tasks where the target labels are not uniforly bounded, and a variety of applications such as saple bias correction Dudík et al, 6; Huang et al, 6; Cortes et al, 8; Sugiyaa et al, 8; Bickel et al, 7, doain adaptation Ben-David et al, 7; Blitzer et al, 8; Daué III and Marcu, 6; Jiang and Zhai, 7; Mansour et al, 9; Cortes and Mohri, 3, or the analysis of boosting Dasgupta and Long, 3, where the iportance weighting technique is used Cortes et al, It is therefore critical to derive learning guarantees that hold for these scenarios and the general case of unbounded loss functions When the class of functions is unbounded, a single function ay take arbitrarily large values with arbitrarily sall probabilities This is probably the ain challenge in deriving unifor conc 3 Corinna Cortes and Spencer Greenberg and Mehryar Mohri

2 CORTES, GREENBERG, AND MOHRI vergence bounds for unbounded losses This proble can be avoided by assuing the existence of an envelope, that is a single non-negative function with a finite expectation lying above the absolute value of the loss of every function in the hypothesis set Dudley, 984; Pollard, 984; Dudley, 987; Pollard, 989; Haussler, 99, an alternative assuption siilar to Hoeffding s inequality based on the expectation of a hyperbolic function, a quantity siilar to the oent-generating function, is used by Meir and Zhang 3 However, in any probles, eg, in the analysis of iportance weighting even for coon distributions, there exists no suitable envelope function Cortes et al, Instead, the second or soe other th-oent of the loss sees to play a critical role in the analysis Thus, instead, we will consider here the assuption that soe th-oent of the loss functions is bounded as in Vapnik 998, 6b This paper presents in detail two-sided generalization bounds for unbounded loss functions under the assuption that soe th-oent of the loss functions, >, is bounded The proof of these bounds akes use of relative deviation generalization bounds in binary classification, which we also prove and discuss in detail Much of the results and aterial we present is not novel and the paper has therefore a survey nature However, our presentation is otivated by the fact that the proofs given in the past for these generalization bounds were either incorrect or incoplete We now discuss in ore detail prior results and proofs One-side relative deviation bounds were first given by Vapnik 998, later iproved by a constant factor by Anthony and Shawe-Taylor 993 These publications and several others have all relied on a lower bound on the probability that a binoial rando variable of trials exceeds its expected value when the bias verifies p > This also later appears in Vapnik 6a and iplicitly in other publications referring to the relative deviations bounds of Vapnik 998 To the best of our knowledge, no actual proof of this inequality was ever given in the past in the achine learning literature before our recent work Greenberg and Mohri, 3 One attept was ade to prove this lea in the context of the analysis of soe generalization bounds Jaeger, 5, but unfortunately that proof is not sufficient to port the general case needed for the proof of the relative deviation bound of Vapnik 998 We present the proof of two-sided relative deviation bounds in detail using the recent results of Greenberg and Mohri 3 The two-sided versions we present, as well as several consequences of these bounds, appear in Anthony and Bartlett 999 However, we could not find a full proof of the two-sided bounds in any prior publication Our presentation shows that the proof of the other side of the inequality is not syetric and cannot be iediately obtained fro that of the first side inequality Additionally, this requires another proof related to the binoial distributions given by Greenberg and Mohri 3 Relative deviation bounds are very inforative guarantees in achine learning of independent interest, regardless of the key role they play in the proof of unbounded loss learning bounds They lead to sharper generalization bounds whose right-hand side is expressed as the interpolation of a O/ ter and a O/ ter that adits as a ultiplier the epirical error or the generalization error In particular, when the epirical error is zero, this leads to faster rate bounds We present in detail the proof of this type of results as well as that of several others of interest Anthony and Bartlett, 999 Let us ention that, in the for presented by Vapnik 998, relative deviation bounds suffer fro a discontinuity at zero zero denoinator, a proble that also affects inequalities for the other side and which sees not to have been rigorously treated by previous work Our proofs and results explicitly deal with this issue We use relative deviations bounds to give the full proofs of two-sided generalization bounds for unbounded losses with finite oents of order, both in the case < and the case >

3 RELATIVE DEVIATION AND GENERALIZATION WITH UNBOUNDED LOSS FUNCTIONS One-sided generalization bounds for unbounded loss functions were first given by Vapnik 998, 6b under the sae assuptions and also using relative deviations The one-sided version of our bounds for the case < coincides with that of Vapnik, 998, 6b odulo a constant factor, but the proofs given by Vapnik in both books see to be incorrect The core coponent of our proof is based on a different technique using Hölder s inequality We also present soe ore explicit bounds for the case < by approxiating a coplex ter appearing in these bounds The one-sided version of the bounds for the case > are also due to Vapnik 998, 6b with siilar questions about the proofs In that case as well, we give detailed proofs using the Cauchy-Schwarz inequality in the ost general case where a positive constant is used in the denoinator to avoid the discontinuity at zero These learning bounds can be used directly in the analysis of unbounded loss functions as in the case of iportance weighting Cortes et al, The reainder of this paper is organized as follows In Section, we briefly introduce soe definitions and notation used in the next sections Section 3 presents in detail relative deviation bounds as well as several of their consequences Next, in Section 4 we present generalization bounds for unbounded loss functions under the assuption that the oent of order is bounded first in the case < Section 4, then in the case > Section 4 eliinaries We consider an input space X and an output space Y, which in the particular case of binary classification is Y = {, +} or Y = {, }, or a easurable subset of R in regression We denote by D a distribution over Z = X Y For a saple S of size drawn fro D, we will denote by D the corresponding epirical distribution, that is the distribution corresponding to drawing a point fro S uniforly at rando Throughout this paper, H denotes a hypothesis of functions apping fro X to Y The loss incurred by hypothesis h H at z Z is denoted by Lh, z L is assued to be non-negative, but not necessarily bounded We denote by Lh the expected loss or generalization error of a hypothesis h H and by L S h its epirical loss for a saple S: Lh = E Lh, z LS h = E Lh, z z D z D For any >, we also use the notation L h = E z D L h, z and L h = E z DL h, z for the th oents of the loss When the loss L coincides with the standard zero-one loss used in binary classification, we equivalently use the following notation Rh = E hx y RS h = E hx y z=x,y D z=x,y D In Vapnik, 998p4-6, stateent 537 cannot be derived fro assuption 535, contrary to what is claied by the author, and in general it does not hold: the first integral in 537 is restricted to a sub-doain and is thus saller than the integral of 535 Furtherore, the ain stateent claied in Section 56 is not valid In Vapnik, 6bp-, the author invokes the Lagrange ethod to show the ain inequality, but the proof steps are not atheatically justified Even with our best efforts, we could not justify soe of the steps and strongly believe the proof not to be correct In particular, the way function z is concluded to be equal to one over the first interval is suspicious and not rigorously justified Several of the coents we ade for the case < hold here as well In particular, the author s proof is not based on clear atheatical justifications Soe steps see suspicious and are not convincing, even with our best efforts to justify the 3

4 CORTES, GREENBERG, AND MOHRI We will soeties use the shorthand x to denote a saple of > points x,, x X For any hypothesis set H of functions apping X to Y = {, +} or Y = {, } and saple x, we denote by S H x the nuber of distinct dichotoies generated by H over that saple and by Π H the growth function: S H x = Card {hx,, hx : h H } 3 Π H = 3 Relative deviation bounds ax S Hx x X 4 In this section we prove a series of relative deviation learning bounds which we use in the next section for deriving generalization bounds for unbounded loss functions We will assue throughout the paper, as is coon in uch of learning theory, that each expression of the for is a easurable function, which is not guaranteed when H is not a countable set This assuption holds nevertheless in ost coon applications of achine learning We start with the proof of a syetrization lea Lea originally presented by Vapnik 998, which is used by Anthony and Shawe-Taylor 993 These publications and several others have all relied on a lower bound on the probability that a binoial rando variable of trials exceeds its expected value when the bias verifies p > To our knowledge, no rigorous proof of this fact was ever provided in the literature in the full generality needed The proof of this result was recently given by Greenberg and Mohri 3 Lea Greenberg and Mohri 3 Let X be a rando variable distributed according to the binoial distribution B, p with a positive integer the nuber of trials and p > the probability of success of each trial Then, the following inequality holds: X EX > 4, 5 where EX = p The lower bound is never reached but is approached asyptotically when = as p fro the right Our proof of Lea is ore concise than that of Vapnik 998 Furtherore, our stateent and proof handle the technical proble of discontinuity at zero ignored by previous authors The denoinator ay in general becoe zero, which would lead to an undefined result We resolve this issue by including an arbitrary positive constant τ in the denoinator in ost of our expressions For the proof of the following result, we will use the function F defined over, +, + by F : x, y x y x+y+ By Lea 9, F x, y is increasing in x and decreasing in y Lea Let < Assue that ɛ > Then, for any hypothesis set H and any τ >, the following holds: S D Rh R S h Rh + τ 4 S,S D R S h R S h R S h + R S h + 4

5 RELATIVE DEVIATION AND GENERALIZATION WITH UNBOUNDED LOSS FUNCTIONS oof We give a concise version of the proof given by Vapnik, 998 We first show that the following iplication holds for any h H: Rh R S h RS h > Rh F R S h, R S h 6 Rh + τ The first condition can be equivalently rewritten as R S h < Rh ɛrh + τ, which iplies R S h < Rh ɛrh and ɛ < Rh, 7 since R S h Assue that the antecedent of the iplication 6 holds for h H Then, in view of the onotonicity properties of function F Lea 9, we can write: F R S h, R S h F Rh, Rh ɛrh RS h > Rh and st ineq of 7 = Rh Rh ɛrh Rh ɛrh + ɛrh nd ineq of 7 Rh ɛ + Rh = ɛ, ɛ Rh > which proves 6 Now, by definition of the reu, for any η >, there exists h H such that Rh R S h Rh R S h Rh + τ Rh + τ η 8 Using the definition of h and iplication 6, we can write R S h R S h S,S D R S h + R S h + R S h R S h S,S D S,S D = S D R S h + R S h + Rh R S h Rh + τ Rh R S h Rh + τ by def of R S h > Rh iplication 6 R S D S h > Rh independence We now show that this iplies the following inequality R S h R S h S,S D R S h + R S h + 4 S D Rh R S h Rh + τ + η, 9 5

6 CORTES, GREENBERG, AND MOHRI by distinguishing two cases If Rh, since ɛ >, by Theore the inequality S D R S h > Rh > 4 holds, which yields iediately 9 Otherwise we have Rh ɛ Then, by 7, the condition Rh R S h cannot hold for any saple S D Rh +τ Rh which by 8 iplies that the condition R S h + η cannot hold for any saple Rh+τ S D, in which case 9 trivially holds Now, since 9 holds for all η >, we can take the liit η and use the right-continuity of the cuulative distribution to obtain S,S D R S h R S h R S h + R S h + which copletes the proof of Lea 4 S D Rh R S h, Rh + τ Note that the factor of 4 in the stateent of lea can be odestly iproved by changing the condition assued fro ɛ > to ɛ > k for constant values of k > This leads to a slightly better lower bound on S D R S h > Rh, eg 3375 rather than 4 for k =, k at the expense of not covering cases where the nuber of saples is less than For soe values of k, eg k =, covering these cases is not needed for the proof of our ain theore Theore 5 though However, this does not see to siplify the critical task of proving a lower bound on S D R S h > Rh, that is the probability that a binoial rando variable B, p exceeds its expected value when p > k One ight hope that restricting the range of p in this way would help siplify the proof of a lower bound on the probability of a binoial exceeding its expected value Unfortunately, our analysis of this proble and proof Greenberg and Mohri, 3 suggest that this is not the case since the regie where p is sall sees to be the easiest one to analyze for this proble The result of Lea is a one-sided inequality The proof of a siilar result Lea 4 with the roles of Rh and R S h interchanged akes use of the following theore Lea 3 Greenberg and Mohri 3 Let X be a rando variable distributed according to the binoial distribution B, p with a positive integer and p < Then, the following inequality holds: X EX > 4, where EX = p The proof of the following lea Lea 4 is novel 3 While the general strategy of the proof is siilar to that of Lea, there are soe non-trivial differences due to the requireent p < of Theore 3 The proof is not syetric as shown by the details given below Lea 4 Let < Assue that ɛ > Then, for any hypothesis set H and any τ > the following holds: S D R S h Rh R S h + τ 4 S,S D 3 A version of this lea is stated in Boucheron et al, 5, but no proof is given ɛ R S h R S h R S h + R S h + 6

7 RELATIVE DEVIATION AND GENERALIZATION WITH UNBOUNDED LOSS FUNCTIONS X p X p p p Figure : These plots depict X EX, the probability that a binoially distributed rando variable X exceeds its expectation, as a function of the trial success probability p The left plot shows only regions satisfying p > whereas the right plot shows only regions satisfying p > Each colored line corresponds to a different nuber of trials, =, 3,, 4 The dashed horizontal line at 4 represents the value of the lower bound used in the proof of lea oof oceeding in a way siilar to the proof of Lea, we first show that the following iplication holds for any h H: R S h Rh Rh R S h F R S h, R S h R S h + τ The first condition can be equivalently rewritten as Rh < R S h ɛ R S h+τ, which iplies Rh < R S h ɛ R S h and ɛ < RS h, since R S h Assue that the antecedent of the iplication holds for h H Then, in view of the onotonicity properties of function F Lea 9, we can write: F R S h, R S h F R S h, Rh Rh R S h F R S h, R S h ɛ R S h st ineq of = R S h R S h ɛ R S h R S h ɛ R S h + ɛrh nd ineq of Rh ɛ + Rh = ɛ, ɛ Rh > 7

8 CORTES, GREENBERG, AND MOHRI which proves For the application of Theore 3 to a hypothesis h, the condition Rh < is required Observe that this is iplied by the assuptions R S h ɛ and ɛ > : Rh < R S h ɛ R S h ɛ ɛ The rest of the proof proceeds nearly identically to that of Lea = ɛ < In the stateents of all the following results, the ter E x D S Hx can be replaced by the upper bound Π H to derive sipler expressions By Sauer s lea Sauer, 97; Vapnik and Chervonenkis, 97, the VC-diension d of the faily H can be further used to bound these quantities since Π H e d d for d The first inequality of the following theore was originally stated and proven by Vapnik 998, 6b, later by Anthony and Shawe-Taylor 993 in the special case = with a soewhat ore favorable constant, in both cases odulo the incoplete proof of the syetrization and the technical issue related to the denoinator taking the value zero, as already pointed out The second inequality of the theore and its proof are novel Our proofs benefit fro the iproved analysis of Anthony and Shawe-Taylor 993 Theore 5 For any hypothesis set H of functions apping a set X to {, }, and any fixed < and τ >, the following two inequalities hold: Rh R S h S D Rh + τ R S h Rh S D R S h + τ 4 ES H x exp 4 ES H x exp + ɛ + ɛ oof We first consider the case where ɛ then write 4 ES H x exp + ɛ, which is not covered by Lea We can 4 ES H x exp >, + for < Thus, the bounds of the theore hold trivially in that case On the other hand, when, we can apply Lea and Lea 4 Therefore, to prove theore 5, it is sufficient to ɛ work with the syetrized expression with our original expressions Rh R S h Rh+τ RS h R S h, rather than working directly R S h+ R S h+ and Rh RS h Rh+τ To upper bound the probability that the syetrized expression is larger than ɛ, we begin by introducing a vector of Radeacher rando variables σ = σ, σ,, σ, where the σ i are independent, identically distributed rando variables each equally likely to take the value + or Using the shorthand 8

9 RELATIVE DEVIATION AND GENERALIZATION WITH UNBOUNDED LOSS FUNCTIONS x for x,, x, we can then write S,S D R S h R S h R S h + R S h + = x D = x D,σ = E x D Now, for a fixed x, we have E σ =, thus, by Hoeffding s inequality, we can write i= hx +i hx i i= hx +i + hx i + σ i= σ ihx +i hx i σ i= hx +i + hx i i= σ ihx +i hx i i= hx +i + hx i + i= σ ihx +i hx i i= hx +i + hx i + i= σ ihx +i hx i i= hx +i+hx i + x x i= exp hx +i + hx i + ɛ i= hx +i hx i exp + i= hx +i + hx i ɛ i= hx +i hx i + Since the variables hx i, i,, take values in {, }, we can write hx +i hx i = i= hx +i + hx i hx +i hx i i= hx +i + hx i hx +i + hx i i= i=, where the last inequality holds since and the su is either zero or greater than or equal to one In view of this identity, we can write i= σ ihx +i hx i σ x ɛ exp i= hx +i + hx i We note now that the reu over h H in the left-hand side expression in the stateent of our theore need not be over all hypothesis in H: without changing its value, we can replace H with a saller hypothesis set where only one hypothesis reains for each unique binary vector + 9

10 CORTES, GREENBERG, AND MOHRI hx, hx,, hx The nuber of such hypotheses is S H x, thus, by the union bound, the following holds: σ i= σ ihx +i hx i i= hx +i + hx i x S H x exp + ɛ The result follows by taking expectations with respect to x and applying Lea and Lea 4 respectively Corollary 6 Let < and let H be a hypothesis set of functions apping X to {, } Then, for any δ >, each of the following two inequalities holds with probability at least δ: Rh R S h + Rh Rh R S h + Rh log ES H x + log 4 δ log ES H x + log 4 δ oof The result follows directly fro Theore 5 by setting δ to atch the upper bounds and taking the liit τ For =, the inequalities becoe Rh R S h Rh log ES Hx + log 4 δ R S h Rh 3 Rh log ES Hx + log 4 δ, 4 with the failiar dependency O log/d /d The advantage of these relative deviations is clear For sall values of Rh or Rh these inequalities provide tighter guarantees than standard generalization bounds Solving the corresponding second-degree inequalities in Rh or Rh leads to the following results Corollary 7 Let < and let H be a hypothesis set of functions apping X to {, } Then, for any δ >, each of the following two inequalities holds with probability at least δ: Rh R S h + R S h Rh + R S h log ES Hx + log 4 δ Rh log ES Hx + log 4 δ + 4 log ES Hx + log 4 δ + 4 log ES Hx + log 4 δ

11 RELATIVE DEVIATION AND GENERALIZATION WITH UNBOUNDED LOSS FUNCTIONS oof The second-degree inequality corresponding to 3 can be written as Rh Rhu RS h, with u = gives: log ES H x +log 4 δ, and iplies Rh u + u + R S h Squaring both sides Rh u + u + R S h = u + u u + R S h + u + R S h u + u u + R S h + u + R S h = 4u + u R S h + R S h The second inequality can be proven in the sae way fro 4 The learning bounds of the corollary ake clear the presence of two ters: a ter in O/ and a ter in O/ which adits as a factor R S h or Rh and which for sall values of these ters can be ore favorable than standard bounds Theore 5 can also be used to prove the following relative deviation bounds The following theore and its proof assuing the result of Theore 5 were given by Anthony and Bartlett 999 Theore 8 For all < ɛ <, ν >, the following inequalities hold: S D S D Rh R S h Rh + Rh + ν R S h Rh Rh + Rh + ν νɛ 4 ES H x exp ɛ νɛ 4 ES H x exp ɛ oof We prove the first stateent, the proof of the second stateent is identical odulo the perutation of the roles of Rh and R S h To do so, it suffices to deterine ɛ > such that S D Rh R S h Rh + Rh + ν S D Rh R S h, Rh + τ since we can then apply theore 5 with = to bound the right-hand side and take the liit as τ to eliinate the τ-dependence To find such a choice of ɛ, we begin by observing that for any h H, Rh R S h Rh + Rh + ν ɛ Rh + ɛ ɛ R S h + ɛ ν 5 ɛ Assue now that Rh R S h ɛ for soe ɛ >, which is equivalent to Rh R S h + Rh+τ ɛ Rh + τ We will prove that this iplies 5 To show that, we distinguish two cases, Rh + τ µ ɛ and Rh + τ > µ ɛ, with µ > The first case iplies the following: Rh + τ µ ɛ Rh R S h + ɛ µ ɛ Rh R S h + µɛ

12 CORTES, GREENBERG, AND MOHRI The second case Rh + τ > µ ɛ is equivalent to ɛ < Rh+τ µ and iplies ɛ < Rh + τ µ Rh R S h + Rh + τ µ Rh µ µ R S h + τ µ Observe now that since µ µ >, both cases iply Rh µ µ R S h + τ µ + µɛ 6 We now choose ɛ and µ to ake 6 atch 5 by setting µ which gives: µ = + ɛ ɛ = ɛ ν τ ɛ ɛ With these choices, the following inequality holds for all h H: which concludes the proof µ = +ɛ ɛ and Rh R S h Rh + τ ɛ Rh R S h Rh + Rh + ν ɛ, The following corollary was given by Anthony and Bartlett 999 Corollary 9 For all ɛ >, v >, the following inequality holds: S D oof Observe that Rh + v R S h 4 ES H x exp τ µ + µɛ = ɛ ɛ ν, vɛ 4 + v Rh R S h Rh + Rh + ν Rh R S h > Rh + Rh + νɛ Rh > + ɛ ɛ Rh + ɛν ɛ To derive the stateent of the corollary fro that of Theore 8, we identify +ɛ ɛ with + v, which gives ɛ + v = v, that is we choose ɛ = v ɛν +v, and siilarly identify ɛ with ɛ, that is ɛ = v +v ν = v ν, thus we choose ν = v ɛ With these choices of ɛ and ν, the coefficient +v vɛ in the exponential appearing in the bounds of Theore 8 can be rewritten as follows: ɛ v v +v 4v+4 = ɛ +v v v 4v+ = ɛ v 4v+, which concludes the proof ɛ = The result of Corollary 9 is rearkable since it shows that a fast convergence rate of O/ can be achieved provided that we settle for a slightly larger value than the epirical error, one differing by a fixed factor + v The following is an iediate corollary when R S h =, where we take v Corollary For all ɛ >, v >, the following inequality holds: h H : Rh R S h = S D 4 ES H x exp This is the failiar fast rate convergence result for separable cases ɛ 4

13 RELATIVE DEVIATION AND GENERALIZATION WITH UNBOUNDED LOSS FUNCTIONS 4 Generalization bounds for unbounded losses In this section we will ake use of the relative deviation bounds given in the previous section to prove generalization bounds for unbounded loss functions under the assuption that the oent of order of the loss is bounded We will start with the case < and then ove on to considering the case when > As already indicated earlier, the one-sided version of the results presented in this section were given by Vapnik 998 with slightly different constants, but the proofs do not see to be correct or coplete The second stateents in all these results other side of the inequality are new Our proofs for both sets of results are new 4 Bounded oent with < Our first theore reduces the proble of deriving a relative deviation bound for an unbounded loss function with L h = E z D Lh, z < + for all h H, to that of relative deviation bound for binary classification To siplify the presentation of the results, in what follows we will use the shorthand Lh, z > t instead of z D Lh, z > t, and siilarly Lh, z > t instead of z DLh, z > t Theore Let <, < ɛ, and < τ < ɛ For any loss function L not necessarily bounded and hypothesis set H such that L h < + for all h H, the following two inequalities hold: Lh L S h L h + τ > Γ, ɛ ɛ Lh L S h L h + τ > Γ, ɛ ɛ,t R,t R with Γ, ɛ = + τ + + τ Lh, z > t Lh, z > t Lh, z > t + τ Lh, z > t Lh, z > t Lh, z > t + τ + log/ɛ, oof We prove the first stateent The second stateent can be shown in a very siilar way Fix < and ɛ > and assue that for any h H and t, the following holds: Lh, z > t Lh, z > t ɛ 7 Lh, z > t + τ We show that this iplies that for any h H, Lh L S h L h+τ Lebesgue integral, we can write and, siilarly, L h = L h = Lh = Lh = E Lh, z = z D E Lh, z = z D L h, z > t dt = 3 Γ, ɛɛ By the properties of the Lh, z > t dt Lh, z > t dt, t Lh, z > t dt

14 CORTES, GREENBERG, AND MOHRI In what follows, we use the notation I = L h + τ Let t = si and t = t ɛ for s > To bound Lh Lh, we siply bound Lh, z > t Lh, z > t by Lh, z > t for large values of t, that is t > t, and use inequality 7 for saller values of t: Lh Lh = t Lh, z > t Lh, z > t dt ɛ Lh, z > t + τ dt + t Lh, z > t dt For relatively sall values of t, Lh, z > t is close to one Thus, we can write Lh Lh = t ɛ t + τ dt + ftgt dt, t ɛ Lh, z > t + τ dt + t Lh, z > tdt with γ I ɛ + τ if t t ft = γ t Lh, z > t + τ ɛ if t < t t γ t Lh, z > t ɛ if t < t gt = γ I γ t Lh,z>t γ t if t t if t < t t ɛ if t < t, where γ, γ are positive paraeters that we shall select later Now, since >, by Hölder s inequality, Lh Lh ft + dt gt dt The first integral on the right-hand side can be bounded as follows: ft dt = t + τγ I ɛ dt + γ ɛ τ t t t dt + γ + τγ I t ɛ + γ ɛ τt t + γ ɛ I γ + τs + γ + s /ɛ τɛ I γ + τs + γ + s τ ɛ I t t Lh, z > tɛ dt 4

15 RELATIVE DEVIATION AND GENERALIZATION WITH UNBOUNDED LOSS FUNCTIONS Since t /t = /ɛ, the second one can be coputed and bounded following gt dt = t t t γ dt + γ I = s + γ γ s + γ γ s + γ γ = s + γ γ dt + t + Cobining the bounds obtained for these integrals yields directly Lh Lh γ + τs + γ + s τ ɛ I = γ + τs + γ + s τ s γ s γ + γ Lh, z > t dt t γ ɛ t log ɛ + t Lh, z > t dt t γ ɛ t log ɛ + t Lh, z > t dt t γ ɛ I t log ɛ + γ ɛ s I ɛ log ɛ + s + γ log ɛ + s log ɛ + ɛi s Observe that the expression on the right-hand side can be rewritten as u v ɛi where the vectors u and v are defined by u = γ + τ s, γ + s τ and v = v, v = s γ, γ The inner product u v does not depend on γ log ɛ + s and γ and by the properties of Hölder s inequality can be reached when u and the vector v = v, v are collinear γ and γ can be chosen so that detu, v =, since this condition can be rewritten as s γ + τ log γ ɛ + s + s τ γ =, 8 s γ or equivalently, γ γ Thus, for such values of γ and γ, the following inequality holds: log ɛ + + s τ = 9 s Lh Lh u v ɛi = fs ɛi, 5

16 CORTES, GREENBERG, AND MOHRI with fs = + τ s + + s τ = + τ s + + s τ log ɛ + s log ɛ + s Setting s = yields the stateent of the theore The next corollary follows iediately by upper bounding the right-hand side of the learning bounds of theore using theore 5 It provides learning bounds for unbounded loss functions in ters of the growth functions in the case < Corollary Let ɛ <, <, and < τ < ɛ For any loss function L not necessarily bounded and hypothesis set H such that L h < + for all h H, the following inequalities hold: Lh Lh L h + τ > Γ, ɛɛ 4 ES Q z exp Lh Lh L h + τ > Γ, ɛɛ 4 ES Q z exp + ɛ + ɛ, where Q is the set of functions Q = {z Lh,z>t h H, t R}, and Γ, ɛ = + τ + + τ + log/ɛ The following corollary gives the explicit result for = Corollary 3 Let ɛ < and < τ < ɛ 4 For any loss function L not necessarily bounded and hypothesis set H such that L h < + for all h H, the following inequalities hold: with Γ, ɛ = +τ h H, t R} + Lh Lh L h + τ > Γ, ɛɛ 4 ES Q z exp Lh Lh L h + τ > Γ, ɛɛ 4 ES Q z exp ɛ 4 ɛ + 4 τ + log ɛ and Q the set of functions Q = {z Lh,z>t 4, Corollary 4 Let L be a loss function not necessarily bounded and H a hypothesis set such that L h < + for all h H Then, for any δ >, with probability at least δ, each of the 6

17 RELATIVE DEVIATION AND GENERALIZATION WITH UNBOUNDED LOSS FUNCTIONS following inequalities holds for all h H: Lh L S h + L h log ES Q z + log δ Γ, log ES Q z L S h Lh + L + log δ h Γ, log ES Q z + log δ log ES Q z + log δ, where Q is the set of functions Q = {z Lh,z>t h H, t R} and Γ, ɛ = + + log ɛ oof For any ɛ >, let fɛ = Γ, ɛɛ Then, by Corollary 3, Lh Lh f L h + τ 4 ES Q z ɛ exp 4 Setting the right-hand side to ɛ and using inversion yields iediately the first inequality The second inequality is proven in the sae way Observe that, odulo the factors in Γ, the bounds of the corollary adit the standard / dependency and that the factors in Γ are only logarithic in 4 Bounded oent with > This section gives two-sided generalization bounds for unbounded losses with finite oents of order, with > As for the case < <, the one-sided version of our bounds coincides with that of Vapnik 998, 6b odulo a constant factor, but, here again, the proofs given by Vapnik in both books see to be incorrect oposition 5 Let > For any loss function L not necessarily bounded and hypothesis set H such that < L h < + for all h H, the following two inequalities hold: Lh, z > tdt Ψ L h and Lh, z > tdt Ψ L h, where Ψ = oof We prove the first inequality The second can be proven in a very siilar way Fix > and h H As in the proof of Theore, we bound Lh, z > t by for t close to, say t t for soe t > that we shall later deterine We can write t Lh, z > tdt dt + with functions f and g defined as follows: ft = {γi if t t t Lh, z > t if t < t t Lh, z > tdt = 7 gt = γi t ftgtdt, if t t if t < t,

18 CORTES, GREENBERG, AND MOHRI where I = L h and where γ is a positive paraeter that we shall select later By the Cauchy- Schwarz inequality, Thus, we can write Lh, z > tdt Lh, z > tdt ft + dt gt dt γ I t + t γ I t + I t t Lh, z > tdt γ I t γ I + t + dt t t Introducing t with t = I / t leads to Lh, z > tdt γ I t + I γ t + We now seek to iniize the expression γ t + t γ I t γ + t + t I I t +, first as a function of γ t γ This expression can be viewed as the product of the nors of the vectors u = γt, and v = t γ,, with a constant inner product not depending on γ Thus, by the properties of t the Cauchy-Schwarz inequality, it is iniized for collinear vectors and in that case equals their inner product: u v = t + t Differentiating this last expression with respect to t and setting the result to zero gives the iniizing value of t : = For that value of t, u v = which concludes the proof + t = =, 8

19 RELATIVE DEVIATION AND GENERALIZATION WITH UNBOUNDED LOSS FUNCTIONS Theore 6 Let >, < ɛ, and < τ ɛ Then, for any loss function L not necessarily bounded and hypothesis set H such that L h < + and L h < + for all h H, the following two inequalities hold: Lh Lh L h + τ > Λɛ,t R Lh, z > t Lh, z > t Lh, z > t + τ Lh Lh L h + τ > Λɛ,t R Lh, z > t Lh, z > t Lh, z > t + τ, where Λ = + τ oof We prove the first stateent since the second one can be proven in a very siilar way Lh,z>t Lh,z>t Assue that h,t ɛ Fix h H, let J = Lh, z > t dt Lh,z>t+τ and ν = L h By Markov s inequality, for any t >, Lh, z > t = L h, z > t L h t = ν t Using this inequality, for any t >, we can write Lh Lh = = ɛ ɛ t t t Lh, z > t Lh, z > t dt Lh, z > t Lh, z > t dt + Lh, z > t + τ dt + Lh, z > t dt Lh, z > t + τ dt + ɛj + ɛ ν τt + t t t t ν t dt Choosing t to iniize the right-hand side yields t = ν ɛ and gives τ Lh Lh ɛj + ν ɛ τ Lh, z > t Lh, z > t dt Since τ ɛ, ɛ τ = ɛτ oposition 5, the following holds: τ ɛɛ τ = ɛτ Thus, by Lh Lh L h + τ ɛψ ν ν + τ + ɛτ ν ν + τ ɛψ + ɛτ, which concludes the proof Cobining Theore 6 with Theore 5 leads iediately to the following two results 9

20 CORTES, GREENBERG, AND MOHRI Corollary 7 Let >, < ɛ, and < τ ɛ Then, for any loss function L not necessarily bounded and hypothesis set H such that L h < + and L h < + for all h H, the following two inequalities hold: Lh Lh ɛ L h + τ > Λɛ 4 ES Q z exp 4 Lh Lh ɛ > Λɛ 4 ES Q z exp, 4 L h + τ where Λ = Lh,z>t h H, t R} + τ and where Q is the set of functions Q = {z In the following result, PdiG denotes the pseudo-diension of a faily of real-valued functions G Pollard, 984, 989; Vapnik, 998, which coincides with the VC-diension of the corresponding thresholded functions: PdiG = VCdi {x, t gx t> : g G } Corollary 8 Let >, < ɛ Let L be a loss function not necessarily bounded and H a hypothesis set such that L h < + for all h H, and d = Pdi{z Lh, z h H} < + Then, for any δ >, with probability at least δ, each of the following inequalities holds for all h H: Lh Lh + Λ d log e d + log 4 δ L h d log Lh Lh + Λ L e d + log 4 δ h where Λ = 5 Conclusion We presented a series of results for relative deviation bounds used to prove generalization bounds for unbounded loss functions These learning bounds can be used in a variety of applications to deal with the ore general unbounded case The relative deviation bounds are of independent interest and can be further used for a sharper analysis of guarantees in binary classification and other tasks References M Anthony and J Shawe-Taylor A result of Vapnik with applications Discrete Applied Matheatics, 47:7 7, 993 Martin Anthony and Peter L Bartlett Neural Network Learning: Theoretical Foundations Cabridge University ess, 999

21 RELATIVE DEVIATION AND GENERALIZATION WITH UNBOUNDED LOSS FUNCTIONS Kazuoki Azua Weighted sus of certain dependent rando variables Tohoku Matheatical Journal, 93: , 967 Peter L Bartlett and Shahar Mendelson Radeacher and Gaussian coplexities: Risk bounds and structural results Journal of Machine Learning Research, 3, Peter L Bartlett, Stéphane Boucheron, and Gábor Lugosi Model selection and error estiation Machine Learning, 48:85 3, Septeber a Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson Localized Radeacher coplexities In COLT, volue 375, pages Springer-Verlag, b S Ben-David, J Blitzer, K Craer, and F Pereira Analysis of representations for doain adaptation NIPS, 7 Steffen Bickel, Michael Brückner, and Tobias Scheffer Discriinative learning for differing training and test distributions In ICML, pages 8 88, 7 J Blitzer, K Craer, A Kulesza, F Pereira, and J Wortan Learning bounds for doain adaptation NIPS 7, 8 Stéphane Boucheron, Olivier Bousquet, and Gábor Lugosi Theory of classification: a survey of recent advances ESAIM: obability and Statistics, 9:33 375, 5 Corinna Cortes and Mehryar Mohri Doain adaptation and saple bias correction theory and algorith for regression Theoretical Coputer Science, 9474, 3 Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostaizadeh Saple selection bias correction theory In ALT, 8 Corinna Cortes, Yishay Mansour, and Mehryar Mohri Learning bounds for iportance weighting In Advances in Neural Inforation ocessing Systes NIPS, Vancouver, Canada, MIT ess Sanjoy Dasgupta and Philip M Long Boosting with diverse base classifiers In COLT, 3 Hal Daué III and Daniel Marcu Doain adaptation for statistical classifiers Journal of Artificial Intelligence Research, 6: 6, 6 Miroslav Dudík, Robert E Schapire, and Steven J Phillips Correcting saple selection bias in axiu entropy density estiation In NIPS, 6 R M Dudley A course on epirical processes Lecture Notes in Matheatics, 97: 4, 984 R M Dudley Universal Donsker classes and etric entropy Annals of obability, 44:36 36, 987 S Greenberg and M Mohri Tight lower bound on the probability of a binoial exceeding its expectation Technical Report 3-957, Courant Institute, New York, New York, 3 David Haussler Decision theoretic generalizations of the PAC odel for neural net and other learning applications Inf Coput, :78 5, 99

22 CORTES, GREENBERG, AND MOHRI Wassily Hoeffding obability inequalities for sus of bounded rando variables Journal of the Aerican Statistical Association, 583:3 3, 963 Jiayuan Huang, Alexander J Sola, Arthur Gretton, Karsten M Borgwardt, and Bernhard Schölkopf Correcting saple selection bias by unlabeled data In NIPS, volue 9, pages 6 68, 6 Savina Andonova Jaeger Generalization bounds and coplexities based on sparsity and clustering for convex cobinations of functions fro rando classes Journal of Machine Learning Research, 6:37 34, 5 Jing Jiang and ChengXiang Zhai Instance Weighting for Doain Adaptation in NLP In ACL, 7 V Koltchinskii Local radeacher coplexities and oracle inequalities in risk iniization Annals of Statistics, 346, 6 Vladiir Koltchinskii and Ditry Panchenko Radeacher processes and bounding the risk of function learning In High Diensional obability II, pages Birkhäuser, Vladir Koltchinskii and Ditry Panchenko Epirical argin distributions and bounding the generalization error of cobined classifiers Annals of Statistics, 3, Yishay Mansour, Mehryar Mohri, and Afshin Rostaizadeh Doain adaptation: Learning bounds and algoriths In COLT, 9 Colin McDiarid On the ethod of bounded differences Surveys in Cobinatorics, 4: 48 88, 989 Ron Meir and Tong Zhang Generalization Error Bounds for Bayesian Mixture Algoriths Journal of Machine Learning Research, 4:839 86, 3 David Pollard Convergence of Stochastic ocessess Springer, New York, 984 David Pollard Asyptotics via epirical processes Statistical Science, 44:34 366, 989 Norbert Sauer On the density of failies of sets Journal of Cobinatorial Theory, Series A, 3 :45 47, 97 M Sugiyaa, S Nakajia, H Kashia, P von Bünau, and M Kawanabe Direct iportance estiation with odel selection and its application to covariate shift adaptation In NIPS, 8 Michael Talagrand Sharper bounds for gaussian and epirical processes Annals of obability, :8 76, 994 Vladiir N Vapnik Statistical Learning Theory John Wiley & Sons, 998 Vladiir N Vapnik Estiation of Dependences Based on Epirical Data Springer-Verlag, 6a Vladiir N Vapnik Estiation of Dependences Based on Epirical Data, second edition Springer, Berlin, 6b

23 RELATIVE DEVIATION AND GENERALIZATION WITH UNBOUNDED LOSS FUNCTIONS Vladiir N Vapnik and Alexey Chervonenkis On the unifor convergence of relative frequencies of events to their probabilities Theory of obability and Its Applications, 6:64, 97 Appendix A Leas in port of Section 3 Lea 9 Let < and for any η >, let f :, +, + R be the function defined by f : x, y x y Then, f is a strictly increasing function of x and a strictly x+y+η decreasing function of y oof f is differentiable over its doain of definition and for all x, y, +, +, f x f y x + y + η x, y = x + y + η x, y = x y x + y + η x y x + y + η x + y + η x + y + η = x + + y + η > x + y + η + = + x + y + η x + y + η + < 3

Relative Deviation Learning Bounds and Generalization with Unbounded Loss Functions

Relative Deviation Learning Bounds and Generalization with Unbounded Loss Functions Corinna Cortes Spencer Greenberg Mehryar Mohri January, 9 Abstract We present an extensive analysis of relative deviation