arxiv: v2 [stat.ml] 23 Feb 2016

Size: px

Start display at page:

Download "arxiv: v2 [stat.ml] 23 Feb 2016"

Erin Parks
6 years ago
Views:

1 Perutational Radeacher Coplexity A New Coplexity Measure for Transductive Learning Ilya Tolstikhin 1, Nikita Zhivotovskiy 3, and Gilles Blanchard 4 arxiv: v stat.ml 3 Feb Max-Planck-Institute for Intelligent Systes, Tübingen, Gerany ilya@tuebingen.pg.de Moscow Institute of Physics and Technology, Moscow, Russia 3 Institute for Inforation Transission Probles, Moscow, Russia nikita.zhivotovskiy@phystech.edu 4 Departent of Matheatics, Universität Potsda, Potsda, Gerany gilles.blanchard@ath.uni-potsda.de Abstract. Transductive learning considers situations when a learner observes labelled training points and u unlabelled test points with the final goal of giving correct answers for the test points. This paper introduces a new coplexity easure for transductive learning called Perutational Radeacher Coplexity (PRC) and studies its properties. A novel syetrization inequality is proved, which shows that PRC provides a tighter control over expected rea of epirical processes copared to what happens in the standard i.i.d. setting. A nuber of coparison results are also provided, which show the relation between PRC and other popular coplexity easures used in statistical learning theory, including Radeacher coplexity and Transductive Radeacher Coplexity (TRC). We argue that PRC is a ore suitable coplexity easure for transductive learning. Finally, these results are cobined with a standard concentration arguent to provide novel data-dependent risk bounds for transductive learning. Keywords: Transductive Learning, Radeacher Coplexity, Statistical Learning Theory, pirical Processes, Concentration Inequalities 1 Introduction Radeacher coplexities (14, ) play an iportant role in the widely used concentration-based approach to statistical learning theory 4, which is closely related to the analysis of epirical processes 1. They easure a coplexity of function classes and provide data-dependent risk bounds in the standard i.i.d. fraework of inductive learning, thanks to syetrization and concentration inequalities. Recently, a nuber of attepts were ade to apply this achinery also to the transductive learning setting. In particular, the authors of 10 introduced a notion of transductive Radeacher coplexity and provided an extensive study of its properties, as well as general transductive risk bounds based on this new coplexity easure.

2 Ilya Tolstikhin, Nikita Zhivotovskiy, and Gilles Blanchard In the transductive learning, a learner observes labelled training points and u unlabelled test points. The goal is to give correct answers on the test points. Transductive learning naturally appears in any odern large-scale applications, including text ining, recoender systes, and coputer vision, where often the objects to be classified are available beforehand. There are two different settings of transductive learning, defined by V. Vapnik in his book, Chap. 8. The first one assues that all the objects fro the training and test sets are generated i.i.d. fro an unknown distribution P. The second one is distribution free, and it assues that the training and test sets are realized by a unifor and rando partition of a fixed and finite general population of cardinality N := +u into two disjoint subsets of cardinalities and u; oreover, no astions are ade regarding the underlying source of this general population. The second setting has gained uch attention 5 (, 9, 7, 10, 8, and 0), probably due to the fact that any upper risk bound for this setting directly iplies a risk bound also for the first setting, Theore 8.1. In essence, the second setting studies unifor deviations of risks coputed on two disjoint finite saples. Following Vapnik s discussion in 6, p. 458, we would also like to ephasize that the second setting of transductive learning naturally appears as a iddle step in proofs of the standard inductive risk bounds, as a result of syetrization or the so-called double-saple trick. This way better transductive risk bounds also translate into better inductive ones. An iportant difference between the two settings discussed above lies in the fact that the eleents of the training set in the second setting are interdependent, because they are sapled uniforly without replaceent fro the general population. As a result, the standard techniques developed for inductive learning, including concentration and Radeacher coplexities entioned in the beginning, can not be applied in this setting, since they are heavily based on the i.i.d. astion. Therefore, it is iportant to study epirical processes in the setting of sapling without replaceent. Previous work. A large step in this direction was ade in 10, where the authors presented a version of McDiarid s bounded difference inequality 5 for sapling without replaceent together with the Transductive Radeacher Coplexity (TRC). As a ain application the authors derived an upper bound on the binary test error of a transductive learning algorith in ters of TRC. However, the analysis of 10 has a nuber of shortcoings. Most iportantly, TRC depends on the unknown labels of the test set. In order to obtain coputable risk bounds, the authors resorted to the contraction inequality 15, which is known to be a loose step 17, since it destroys any dependence on the labels. Another line of work was presented in 0, where variants of Talagrand s concentration inequality were derived for the setting of sapling without replaceent. These inequalities were then applied to achieve transductive risk bounds with fast rates of convergence o( 1/ ), following a localized approach 1. In contrast, in this work we consider only the worst-case analysis based on the 5 For the extensive overview of transductive risk bounds we refer the reader to 18.

3 Perutational Radeacher Coplexity 3 global coplexity easures. An analysis under additional astions on the proble at hand, including Maen-Tsybakov type low noise conditions 4, is an interesting open question and left for future work. Suary of our results. This paper continues the analysis of epirical processes indexed by arbitrary classes of uniforly bounded functions in the setting of sapling without replaceent, initiated by 10. We introduce a new coplexity easure called perutational Radeacher coplexity (PRC) and argue that it captures the nature of this setting very well. Due to space liitations we present the analysis of PRC only for the special case when the training and test sets have the sae size = u, which is nonetheless sufficiently illustrative 6. We prove a novel syetrization inequality (Theore ), which shows that the expected PRC and the expected rea of epirical processes when sapling without replaceent are equivalent up to ultiplicative constants. Quite rearkably, the new upper and lower bounds (the latter is often called desyetrization inequality) both hold without any additive ters when = u, in contrast to the standard i.i.d. setting, where an additive ter of order O( 1/ ) is unavoidable in the lower bound. For TRC even the upper syetrization inequality 10, Lea 4 includes an additive ter of the order O( 1/ ) and no desyetrization inequality is known. This suggests that PRC ay be a ore suitable coplexity easure for transductive learning. We would also like to note that the proof of our new syetrization inequality is surprisingly siple, copared to the one presented in 10. Next we copare PRC with other popular coplexity easures used in statistical learning theory. In particular, we provide achievable upper and lower bounds, relating PRC to the conditional Radeacher coplexity (Theore 3). These bounds show that the PRC is upper and lowerbounded by the conditional Radeacher coplexity up to additive ters of orders o( 1/ ) and O( 1/ ) respectively, which are achievable (Lea 1). In addition to this, Theore 3 also significantly iproves bounds on the coplexity easure called axiu discrepancy presented in, Lea 3. We also provide a coparison between expected PRC and TRC (Corollary 1), which shows that their values are close up to sall ultiplicative constants and additive ters of order O( 1/ ). Finally, we apply these results to obtain a new coputable data-dependent risk bound for transductive learning based on the PRC(Theore 5), which holds for any bounded loss functions. We conclude by discussing the advantages of the new risk bound over the previously best known one of 10. Notations We will use calligraphic sybols to denote sets, with subscripts indicating their cardinalities: card(z ) =. For any function f we will denote its average value coputed on a finite set S by f(s). In what follows we will consider an arbitrary space Z (for instance, a space of input-output pairs) and class F of functions 6 All the results presented in this paper are also available for the general u case, but we defer the to a future extended version of this paper.

4 4 Ilya Tolstikhin, Nikita Zhivotovskiy, and Gilles Blanchard (for instance, loss functions) apping Z to R. Most of the proofs are deferred to the last section for iproved readability. Arguably, one of the ost popular coplexity easures used in statistical learning theory is the Radeacher coplexity (15, 14, ): Definition 1 (ConditionalRadeacher coplexity). Fix any subsetz = {Z 1,...,Z } Z. The following rando quantity is coonly known as a conditional Radeacher coplexity: ˆR (F,Z ) = ǫ ǫ i f(z i ) where ǫ = {ǫ i } are i.i.d. Radeacher signs, taking values ±1 with probabilities 1/. When the set Z is clear fro the context we will siply write ˆR (F). As discussed in the introduction, Radeacher coplexities play an iportant role in the analysis of epirical processes and statistical learning theory. However, this easure of coplexity was devised ainly for the i.i.d. setting, which is different fro our setting of sapling without replaceent. The following coplexity easure was introduced in 10 to overcoe this issue: Definition (Transductive Radeacher coplexity). Fix any set Z N = {Z 1,...,Z N } Z, positive integers,u such that N = +u, and p 0, 1. The following quantity is called Transductive Radeacher coplexity (TRC): ( 1 ˆR +u(f,z td N,p) = + 1 N σ i f(z i ), u )σ where σ = {σ 1 } +u are i.i.d. rando variables taking values ±1 with probabilities p and 0 with probability 1 p. We suarize the iportance of these two coplexity easures in the analysis of epirical processes when sapling without replaceent in the following result:, Theore 1. Fix an N-eleent subset Z N Z and let < N eleents of Z be sapled uniforly without replaceent fro Z N. Also let eleents of X be sapled uniforly with replaceent fro Z N. Denote Z u := Z N \ Z with u := card(z u ) = N. The following upper bound in ters of the i.i.d. Radeacher coplexity was provided in 0: ( f(zu ) f(z ) ) N ˆR Z u (F,X ). (1) X The following bound in ters of TRC was provided in 10. Assue that functions in F are uniforly bounded by B. Then for p 0 := u N and c 0 < 5.05: Z ( f(zu ) f(z ) ) ˆR +u(f,z td N,p 0 )+c 0 B N in(,u). () u

5 Perutational Radeacher Coplexity 5 While (1) did not explicitly appear in 0, it can be iediately derived using 0, Corollary 8 and i.i.d. syetrization of 13, Theore.1. Finally, we introduce our new coplexity easure: Definition 3 (Perutational Radeacher coplexity). Let Z Z be any fixed set of cardinality. For any n {1,..., 1} the following quantity will be called a perutational Radeacher coplexity (PRC): ˆQ,n (F,Z ) = ( f(zk ) f(z n ) ), Zn where Z n is a rando subset of Z containing n eleents sapled uniforly without replaceent and Z k := Z \ Z n. When the set Z is clear fro the context we will siply write ˆQ,n (F). The nae PRC is explained by the fact that if is even then the definitions of ˆQ,/ (F) and ˆR (F) are very siilar. Indeed, the only difference is that the expectation in the PRC is over the randoly peruted sequence containing equal nuber of 1 and +1, whereas in Radeacher coplexity the average is w.r.t. all the possible sequences of signs. The ter perutation coplexity has already appeared in 16, where it was used to denote a novel coplexity easure for a odel selection. However, this easure was specific to the i.i.d. setting and binary loss. Moreover, the bounds presented in 16 were of the sae order as the risk bounds based on the Radeacher coplexity with worse constants in the slack ter. 3 Syetrization and Coparison Results We start with showing a version of the i.i.d. syetrization inequality (references can be found in 15, 13) for the setting of sapling without replaceent. It shows that the expected reu of epirical processes in this setting is up to ultiplicative constants equivalent to the expected PRC. Theore. Fix an N-eleent subset Z N Z and let < N eleents of Z be sapled uniforly without replaceent fro Z N. Denote Z u := Z N \Z with u := card(z u ) = N. If = u and is even then for any n {1,..., 1}: 1 ˆQ,/ (F,Z ) ( f(zu ) f(z ) ) ˆQ,n (F,Z ). Z Z Z The inequalities also hold if we include absolute values inside the rea. Proof. The proof can be found in Sect This inequality should be copared to the previously known coplexity bounds of Theore 1. First of all, in contrast to (1) and () the new bound provides a two sided control, which shows that PRC is a correct coplexity easure for our setting. It is also rearkable that the lower bound (coonly known as

6 6 Ilya Tolstikhin, Nikita Zhivotovskiy, and Gilles Blanchard the desyetrization inequality) does not include any additive ters, since in the standard i.i.d. setting the lower bound holds only up to an additive ter of order O( 1/ ) 13, Sect..1. Also note that this result does not assue the boundedness of functions in F, which is a necessary astions both in () and in the i.i.d. desyetrization inequality. Next we copare PRC with the conditional Radeacher coplexity: Theore 3. Let Z Z be any fixed set of even cardinality. Then: ( ) ˆQ,/ (F,Z ) 1+ ˆR (F,Z ). (3) π Moreover, if the functions in F are absolutely bounded by B then ˆQ,/ (F,Z ) ˆR (F,Z ) B. (4) The results also hold if we include absolute values inside rea in ˆQ,n, ˆR. Proof. Conceptually the proof is based on the coupling between a sequence {ǫ i } of i.i.d. Radeacher signs and a unifor rando perutation {η i} of a set containing / plus and / inus signs. This idea was inspired by the techniques used in 11. The detailed proof can be found in Sect. 5.. Note that a typical order of ˆR (F) is O( 1/ ), thus the ultiplicative upper bound (3) can be uch tighter than the upper bound of (4). We would also like to note that Theore 3 significantly iproves bounds of Lea 3 in, which relate the so-called axial discrepancy easure of the class F to its Radeacher coplexity (for the further discussion we refer to Appendix). Our next result shows that bounds of Theore 3 are essentially tight. Lea 1. Let Z Z with even. There are two finite classes F of functions apping Z to R and absolutely bounded by 1, such that: and F ˆQ,/ (F,Z ) = 0, () 1/ ˆR (F,Z ) 1/ ; (5) ˆQ,/ (F,Z ) = 1, 1 Proof. The proof can be found in Sect π ˆR (F,Z ) π. (6) Inequalities (5) siultaneously show that (a) the order O( 1/ ) of the additive bound (4) can not be iproved, and (b) the ultiplicative upper bound (3) can not be reversed. Moreover, it can be shown using (6) that the factor appearing in (3) can not be iproved to 1+o( 1/ ). Finally, we copare PRC to the transductive Radeacher coplexity: Lea. Fix any set Z N = {Z 1,...,Z N } Z. If = u and N = +u: ˆR N (F,Z N ) ˆR td +u (F,Z N,1/4) ˆR N (F,Z N ).

7 Perutational Radeacher Coplexity 7 Proof. The upper bound was presented in 10, Lea 1. For the lower bound, notice that if p = 1/4 the i.i.d. signs σ i presented in Definition have the sae distributionasǫ i η i,whereǫ i arei.i.d.radeachersignsandη i arei.i.d.bernoulli rando variables with paraeters 1/. Thus, Jensen s inequality gives: ˆR td +u (F,Z N,1/4) = 4 N (ǫ,η) +u ǫ i η i f(z i ) 4 N ǫ +u ǫ i 1 f(z i) Together with Theores and 3 this result shows that when = u the PRC can not be uch larger than transductive Radeacher coplexity: Corollary 1. Using notations of Theore, we have: ( ) 4 ˆQ,/ (F,Z ) + ˆR +u td (F,Z N,1/4). Z πn If functions in F are uniforly bounded by B then we also have a lower bound: ˆQ,/ (F,Z ) 1 Z ˆR +u(f,z td N,1/4)+ B. N Proof. Siply notice that Z ( f(zu ) f(z ) ) = ˆQ N, (F,Z N ).. 4 Transductive Risk Bounds Next we will use the results of Sect. 3 to obtain a new transductive risk bound. First we will shortly describe the setting. We will consider the second, distribution-free setting of transductive learning described in the introduction. Fix any finite general population of input-output pairsz N = {(x i,y i )} N X Y,whereX andy arearbitraryinput andoutput spaces. We ake no astions regardingunderlying sourceofz N. The learner receives the labeled training set Z consisting of < N eleents sapled uniforly without replaceent fro Z N. The reaining test set Z u := Z N \Z is presented to the learner without labels (we will use X u to denote the inputs of Z u ). The goal of the learner is to find a predictor in the fixed hypothesis class H based on the training saple Z and unlabelled test points X u, which has a sall test risk easured using bounded loss function l: Y Y 0,1. For h H and (x,y) Z N denote l h (x,y) = l ( h(x),y ) and also denote the loss class L H = {l h : h H}. Then the test and training risks of h H are defined as err u (h) := l h (Z u ) and err (h) := l h (Z ) respectively. Following risk bound in ters of TRC was presented in 10, Corollary : Theore 4 (10). If = u then with probability at least 1 δ over the rando training set Z any h H satisfies: err u (h) err (h)+ ˆR +u td (L H,Z N,1/4)+11 N + N log(1/δ) (N 1/). (7)

8 8 Ilya Tolstikhin, Nikita Zhivotovskiy, and Gilles Blanchard Using results of Sect. 3 we obtain the following risk bound: Theore 5. If = u and n {1,..., 1} then with probability at least 1 δ over the rando training set Z any h H satisfies: N log(1/δ) err u (h) err (h)+ ˆQ,n (L H,Z ) + S (N 1/). (8) Moreover, with probability at least 1 δ any h H satisfies: err u (h) err (h)+ ˆQ,n (L H,Z )+ Proof. The proof can be found in Sect N log(/δ) (N 1/). (9) We conclude by coparing risk bounds of Theores 5 and 4: 1. First of all, the upper bound of (9) is coputable. This bound is based on the concentration arguent, which shows that the expected PRC (appearing in (8)) can be nicely estiated using the training set. Meanwhile, the upper bound of (7) depends on the unknown labels of the test set through TRC. In order to ake it coputable the authors of 10 resorted to the contraction inequality, which allows to drop any dependence on the labels for Lipschitz losses, which is known to be a loose step 17.. Moreover, we would like to note that for binary loss function TRC (as well as the Radeacher coplexity) does not depend on the labels at all. Indeed, this can be shown by writing l 01 (y,y ) = (1 yy )/ for y,y { 1,+1} and noting that σ i and σ i y are identically distributed for σ i used in Definition. This is not true for PRC, which is sensitive to the labels even in this setting. As a future work we hope to use this fact for analysis in the low noise setting The slack ter appearing in (8) is significantly saller than the one of (7). For instance, if δ = 0.01 then the latter is 13 ties larger. This is caused by the additive ter in syetrization inequality (). At the sae tie, Corollary 1 shows that the coplexity ter appearing in (8) is at ost two ties larger than TRC, appearing in (7). 4. Coparison result of Theore 3 shows that the upper bound of (9) is also tighter than the one which can be obtained using(1) and conditional Radeacher coplexity. 5. Siilar upper bounds (up to extra factor of ) also hold for the excess risk err u (h ) inf h H err u (h), where h iniizes the training risk err over H. This can be proved using a siilar arguent to Theore Finally, one ore application of the concentration arguent can siplify the coputation of PRC, by estiating the expected value appearing in Definition 3 with only one rando partition of Z.

9 Perutational Radeacher Coplexity 9 5 Full Proofs 5.1 Proof of Theore Lea 3. For 0 < N let S := {s 1,...,s } be sapled uniforly without replaceent fro a finite set of real nubers C = {c 1,...,c N } R. Then: 1 s i = ( 1 S N ) 1 z = 1 N ( ) N 1 S C z S ( ) c N i = 1 N c i. 1 N Proof (of Theore ). Fix any positive integers n and k such that n+k =, which iplies n < and k < = u. Note that Lea 3 iplies: f(z u ) = Sk f(sk ), f(z ) = Sn f(sn ), where S k and S n are sapled uniforly without replaceent fro Z u and Z respectively. Using Jensen s inequality we get: ( f(zu ) f(z ) ) ( = f(sk ) f(sn ) ) Z Z S k Sn ( f(sk ) f(s n ) ). (10) (Z,S k,s n) The arginal distribution of (S k,s n ), appearing in (10), can be equivalently describedbyfirstsaplingz froz N,thenS n froz (bothtiesuniforly without replaceent), and setting S k := Z \S n (recall that n+k = ). Thus (Z,S k,s n) ( f(sk ) f(s n ) ) = Z S n ( f(z \S n ) f(s n ) ) Z which copletes the proof of the upper bound. We have shown that for n {1,..., 1} and k := n: ˆQ,n (F,Z ) = ( f(zk ) f(z n ) ), (11) Z (Z k,z n) where Z n and Z k are sapled uniforly without replaceent fro Z N and Z N \ Z n respectively. Let Z n be sapled uniforly without replaceent fro Z N \(Z n Z k ) and let Z u k be the reaining u k eleents of Z N. Using Lea 3 once again we get: f(z n ) (Zn,Z k ) = f(zu k ) (Zn,Z k ). We can rewrite the r.h.s.of (11) as: ( f(zk ) f(z n )+ f(zu k ) f(z n ) (Zn,Z k ) ) (Z n,z k ) ( f(zk ) f(z n )+ f(z u k ) f(z n ) ),,

10 10 Ilya Tolstikhin, Nikita Zhivotovskiy, and Gilles Blanchard where we have used Jensen s inequality. If we take n = k = / we get ( ˆQ,/ (F,Z ) f(zk Z u k ) f(z n Z n ) ). Z It is left to notice that the rando subsets Z k Z u k and Z n Z n have the sae distributions as Z u and Z. 5. Proof of Theore 3 Let = n, ǫ = {ǫ i } be i.i.d.radeacher signs, and η = {η i} be a unifor rando perutation of a set containing n plus and n inus signs. The proof of Theore 3 is based on the coupling of rando variables ǫ and η, which is described in Lea 4. We will need a nuber of definitions. Consider binary cube B := { 1,+1}. Denote S := {v B : v i = 0}, which is a set of all the vectors in B having equal nuber of plus and inus signs. For any v B denote v 1 = v i and consider the following set: T(v) = arg in v S v v 1, which consists of the points in S closest to v in Haing etric. For any v B let t(v) be a rando eleent of T(v), distributed uniforly. We will use t i (v) to denote i-th coordinate of the vector t(v). Reark 1. If v S then T(v) = {v}. Otherwise, T(v) will clearly contain ore than one eleent of S. Naely, it can be shown, that if for soe positive integer q it holds that v i = q, then q is necessarily even and T(v) consists of all the vectors in S which can be obtained by replacing q/ of +1 signs in v with 1 signs, and thus in this case card ( T(v) ) = ( (+q)/) q/. Lea 4 (Coupling). Assue that = n. Then the rando sequence t(ǫ) has the sae distribution as η. Proof. Note that the port of t(ǫ) is equal to S. Fro syetry it is easy to conclude that the distribution of t(ǫ) is exchangable. This eans that it is invariant under perutations and as a consequence unifor on S. Next result is in the core of the ultiplicative upper bound (3). Lea 5. Assue that = n. For any q {1,...,} the following holds: ( ( )) ( ǫ q t(ǫ) = 1 t q (ǫ) 1 (π) 1/) t q (ǫ). n Proof. We will first upper bound P{ǫ q t q (ǫ) t(ǫ) = e}, where e = {e i } is (w.l.o.g.) a sequence of n plus signs followed by a sequence of n inus signs. P{ǫ q t q (ǫ) t(ǫ) = e} = P{ǫ q t q (ǫ) t(ǫ) = e} P{t(ǫ) = e} ( ) = P{ǫ q t q (ǫ) t(ǫ) = e ǫ = s}, (1) n s

11 Perutational Radeacher Coplexity 11 where we have used Lea 4 and the su is over all different sequences of signs s = {s i }. For any s denote S(s) = n j=1 s j and consider ters in (1) corresponding to s with S(s) = 0, S(s) > 0, and S(s) < 0: Case 1: S(s) = 0. These ters will be zero, since t(s) = s. Case :S(s) > 0.Thiseansthats hasoreplussignsthanitshould and accordingto Reark1the apping t( ) will replaceseveralof +1 with -1.In particular, if s q = 1 then t q (s) = s q and thus the corresponding ters will be zero. If s q = 1 and in the sae tie e q = 1 the event {ǫ q t q (ǫ) t(ǫ) = e} also can not hold. Moreover, note that identity e = t(s) can hold only if e T(s), which necessarily leads to { j {1,...,}: sj = 1 } { j {1,...,}: e j = 1 }. (13) Fro this we conclude that if q {1,...,n} then all the ters corresponding to s with S(s) > 0 are zero. We will use U q (e) to denote the subset of B consisting of sequences s, such that (a) S(s) > 0, (b) s q = 1, and (c) condition (13) holds. It can be seen that if s U q (e) then: ( ) 1 n+s(s)/ P{ǫ q t q (ǫ) t(ǫ) = e ǫ = s} =. S(s)/ Thisholdssince,accordingtoReark1,t(ǫ)cantakeexactly ( ) n+s(s)/ S(s)/ different values, while only one of the is equal to e. Let us copute the cardinality of U q (e) for q {n+1,...,}. It is easy to check that condition S(s) = j for soe positive integer j iplies that s has exactly n j inus signs. Considering the fact that s q = 1 for s U q (e) we have: card ( U q (e) ) ( ) n 1 =. n j Cobining everything together we have: s: S(s)>0 P{ǫ q t q (ǫ) t(ǫ) = e ǫ = s} = ½{q > n} Finally, it is easy to show using induction that: ( n n 1 ) n j ) = 1. j=1 ( n+j j ( n n 1 n j ( n+j j=1 j Case 3: S(s) < 0. We can repeat all the steps of the previous case and get: s: S(s)<0 P{ǫ q t q (ǫ) t(ǫ) = e ǫ = s} = 1 ½{q n}. Accounting for these three cases in (1) we conclude that P{ǫ q t q (ǫ) t(ǫ) = e} = 1 ( ) 1, n π ) ).

12 1 Ilya Tolstikhin, Nikita Zhivotovskiy, and Gilles Blanchard where we have used the upper bound on the binoial coefficient fro 19, Corollary.4. We can conclude the proof of lea by writing: ( ǫ q t(ǫ) = t q (ǫ)(1 P{ǫ q t q (ǫ) t(ǫ)}) t q (ǫ) 1 (π) 1/). Proof (of Theore 3). First we prove (3). Let Z = {z 1,...,z }. We can write: ˆQ,n (F) = t i (ǫ)f(z i ) ( 1 (π) 1/) 1 ( 1+ ) π ǫ i t(ǫ)f(z i ) (14) (15) ǫ i f(z i ), (16) where we have used coupling Lea 4 in (14), Lea 5 in (15), and Jensen s inequality in (16). This copletes the proof of (3). Next we prove (4). We have: ˆQ,n (F) ˆR (F) = η η i f(z i ) ǫ ǫ i f(z i ). Using Lea 4 and Jensen s inequality we further get: ˆQ,n (F) ˆR (F) = t i (ǫ)f(z i ) ǫ t ǫ ǫ ǫ i f(z i ) t i (ǫ)f(z i ) ǫ i f(z i ) ǫ t ǫ, (17) where we have, perhaps isleadingly, denoted the conditional expectation with respect to the unifor choice fro T(ǫ) given ǫ using t ǫ. Next we have: t i (ǫ)f(z i ) ǫ i f(z i ) 4 i S(ǫ,t) ǫ i f(z i ), (18) where S(ǫ,t) {1,...,} is a subset of indices, s.t. ( t(ǫ) ) ǫ i i iff i S(ǫ,t). We can continue by writing t i (ǫ)f(z i ) ǫ i f(z i ) 4 f(z i ). (19) i S(ǫ,t)

13 Perutational Radeacher Coplexity 13 Note that since functions in F are absolutely bounded by B: f(z i ) B card(s(ǫ,t)). i S(ǫ,t) Returning to (17) and using Reark 1 we obtain: ˆQ,n (F) ˆR (F) 4B 1 card(s(ǫ,t)) ǫ = ǫ ǫ i. ǫ t Khinchin s inequality 15, Lea 4.1 together with the best known constant due to 1 gives ǫ ǫ i, which copletes the proof of (4). 5.3 Proof of Lea 5 Proof. Let Z = {z 1,...,z }. Take F to be a set of two constant functions, f 1 (z) = 1 and f (z) = 0 for all z Z. Clearly, ˆQ,n (F ) = 0. In the sae tie: { } ǫ i f(z i ) = ǫ ax 0, ǫ i ǫ i, ǫ ǫ where we used Khinchin s inequality. Finally, Khinchin s inequality also gives: { } ax 0, ǫ i = 1 ǫ ǫ i 1. ǫ Next, let F contain ( /) functions, such that their projections on Z recover all the perutations of binary vector containing equal nuber of 0 and 1. Clearly, ) = 1. Straightforward calculations show that in the sae tie ˆR (F ) = 1 ( n) and we conclude the proof using upper and lower bounds on the binoial coefficient fro 19, Corollary.4. in this case ˆQ,n (F 5.4 Proof of Theore 5 The following version of McDiarid s bounded difference inequality for the setting of sapling without replaceent was presented in10, Lea and further iproved in 8, Theore 5: Theore 6 (10, 8). Let Z be sapled uniforly without replaceent fro a fixed set Z +u Z of +u eleents. Let g: Z R be a syetric function s.t. for all i = 1,..., and for all z 1,...,z Z and z 1,...,z Z, g(z 1,...,z ) g(z 1,...,z i 1,z i,z i+1,...,z ) c. (0) Then if = u with probability not less than 1 δ the following holds: c g g+ N 3 log(1/δ) 8(N 1/).

14 14 Ilya Tolstikhin, Nikita Zhivotovskiy, and Gilles Blanchard Note that function h H (err h (Z u ) err h (Z )) aps (X Y) to R and is of course syetric. Straightforward calculations show that this function satisfies bounded difference condition (0) with c = u (10, Inequality9). Theore6 states that with probability not less than 1 δ: N log(1/δ) (err u (h) err (h)) (err u (h) err (h)) + h H S h H (N 1/). (1) Using upper bound of Theore with L H in place of F we coplete the proof of (8). Next, consider a syetric function ˆQ,n (L H,Z ) which also aps (X Y) to R. It can be shown again that it satisfies bounded difference condition (0) with c =. And thus, Theore 6 gives that with probability not less than 1 δ: ˆQ,n (L H,Z ) ˆQ N log(1/δ),n (L H,Z )+ S (N 1/). () Using this inequality together with (8) in a union bound we obtain the second inequality of the theore. Appendix: Iproving Lea 3 of Let µ be a probability distribution on Z and X := {X 1,...,X } be i.i.d. saples selected according to µ. Maxial discrepancy of F was defined in as: ˆD (F,X ) = / f(x i ) f(x i ). i=/+1 It was shown in that if functions in F are uniforly bounded by 1 then: 1 ˆR (F,X ) ˆD (F,X ) ˆR (F,X ) +4. (3) Since eleents in X are i.i.d. the distribution of ˆD is invariant under their perutations and thus ˆD (F,X ) = ˆQ,/ (F,X ). Now we can use Theore 3 to significantly iprove bounds in (3): ˆR (F,X ) ( ˆD (F,X ) 1+ )ˆR (F,X ). π Acknowledgents The authors are thankful to Marius Kloft and Ruth Urner for useful discussions and to the anonyous reviewers for their coents. GB aknowledges port of the DFG through the FOR-1735 grant. NZ was ported solely by the Russian Science Foundation grant (project ).

15 Perutational Radeacher Coplexity 15 References 1. Bartlett, P., Bousquet, O., Mendelson, S.: Local radeacher coplexities. The Annals of Statistics, 33(4), (005). Bartlett, P., Mendelson, S.: Radeacher and Gaussian coplexities: Risk bounds and structural results. Journal of Machine Learning Research, 3, (001) 3. Blu, A., Langford, J.: PAC-MDL Bounds. In: COLT 003, pp (003) 4. Boucheron, S., Lugosi, G., Bousquet, O.: Theory of classification: a survey of recent advances. SAIM: Probability and Statistics, 9, (005) 5. Boucheron, S., Lugosi, G., Massart, P.: Concentration Inequalities: A Nonasyptotic Theory of Independence. Oxford University Press (013) 6. Chapelle, O., Schölkopf, B., Zien, A.: Sei-Supervised Learning. MIT Press (006) 7. Cortes, C., Mohri, M.: On transductive regression. In: NIPS 006, (007) 8. Cortes, C., Mohri, M., Pechyony, D., Rastogi, A.: Stability analysis and learning bounds for transductive regression algoriths. CoRR abs/ (009) 9. Derbeko, P., l-yaniv, R., Meir, R.: xplicit learning curves for transduction and application to clustering and copression algoriths. Journal of Artificial Intelligence Research, (1), (004) 10. l-yaniv, R., Pechyony, D.: Transductive radeacher coplexity and its applications. Journal of Artificial Intelligence Research, 35(1), (009) 11. Gross, D., Nese, V.: Note on sapling without replacing fro a finite collection of atrices. (010) 1. Haagerup, U.: The best constants in Khinchine inequality. Studia Matheatica, 70(3), (1981) 13. Koltchinskii, V.: Oracle inequalities in epirical risk iniization and sparse recovery probles. Springer (011) 14. Koltchinskii, V., Panchenko, D.: Radeacher processes and bounding the risk of function learning. In: Gine. D.., Wellner, J. (eds.) High Diensional Probability, II, pp Birkhauser (1999) 15. Ledoux, M., Talagrand, M.: Probability in Banach Space. Springer-Verlag (1991) 16. Magdon-Isail, M.: Perutation coplexity bound on out-saple error. In: Advances in Neural Inforation Processing Systes(NIPS 010), pp (010) 17. Mendelson, S.: Learning without Concentration. CoRR abs/ (014) 18. Pechyony, D.: Theory and Practice of Transductive Learning. PhD thesis (008) 19. Stanica, P.: Good lower and upper bounds on binoial coefficients. Journal of Inequalities in Pure and Applied Matheatics, (3) (001) 0. Tolstikhin, I., Blanchard, G., Kloft, M.: Localized coplexities for transductive learning. In: COLT 014, pp (014) 1. Van der Vaart, A. W., Wellner, J.: Weak Convergence and pirical Processes: With Applications to Statistics. Springer (000). Vapnik, V.: Statistical Learning Theory. John Wiley & Sons (1998)

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges