E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds on the generalization error of functions learned fro function classes of liited capacity, easured in ters of the growth function and VC-diension for binary-valued function classes in the case of binary classification, and in ters of the covering nubers, pseudo-diension, and fat-shattering diension for real-valued function classes in the case of regression. Table suarizes the nature of results we have obtained. In this lecture, we return to binary classification, but focus on function classes H of the for H = sign(f) for soe real-valued function class F R X. If an algorith learns a function h : X {, } of the for h (x) = sign(f (x)) for soe f F, can we say ore about its generalization error than for a general binary-valued function h : X {, }? Note that the VM algorith falls in this category, while the basic forulations of decision tree and nearest neighbour classification algoriths do not. Let us first see what we can say about the generalization error of h = sign(f ) using the techniques we have learned so far. 2 Bounding Generalization Error of sign(f ) via VCdi(sign(F)) One approach to bounding the generalization error of h sign(f) is to consider the VC-diension of the binary-valued function class sign(f). If VCdi(sign(F)) is finite, then this gives that for any 0 < δ, with probability at least δ (over the draw of D ), where c > 0 is soe fixed constant. D [h ] [h ] + c VCdi(sign(F)) ln + ln ( ) δ However this approach is liited. Consider for exaple the two classifiers shown in Figure. Both are selected fro the class of linear classifiers in R 2 : H = sign(f), where F = {f : R 2 R f(x) = w x + b for soe w R n, b R}. We saw earlier that the VC-diension of this class is 3. The epirical error of h on the training saple is er [h ] = /; for h 2, this is er [h 2 ] = 0. o the above result would give a better bound on the generalization error of h 2 than that of h, even though any of us would intuitively expect h to perfor better on new exaples than h 2. () Figure : Two linear classifiers. 3 Bounding Generalization Error of sign(f ) via Bounds for f As noted above, bounding the generalization error of h = sign(f ) using only the binary values output by sign(f ) has its liitations. Therefore we would like to take into account the real-valued predictions ade There are variants of decision tree and nearest neighbour algoriths that take into account class probabilities or distancebased weights and could be described in the above for, but the basic forulations are not naturally described in this way.

2 Margin Analysis Table : uary of generalization error bounds we have seen so far. Learning Function Loss Capacity Bound on proble class function easures generalization error Binary H {, } X l 0- Π H () w.p. δ (over D ): classification VCdi(H) D [h ] [h ] + c VCdi(H) ln +ln( δ ) Regression F R X l : N (ɛ, F, ) w.p. δ (over D ): 0 l B Pdi(F) er l D [f ] er l [f ] + ɛ (fat F,, δ, B, L) L-Lipschitz fat F (ɛ) (Here ɛ (fat F,, δ, B, L) is the sallest value of ɛ for which the bound obtained in the last lecture is saller than δ; no closed-for expression in general.) Binary H = sign(f) l 0-???? classification F R X using real-valued functions by f. To this end, and to siplify later stateents, let us first extend the l 0- loss to real-valued predictions as follows: define l 0- : {, } R {0, } as l 0- (y, ŷ) = (sign(ŷ) y). (2) This gives for exaple D [f ] = P (x,y) D (sign(f (x)) y) (= D [h ] for h = sign(f )). We can then consider obtaining bounds on the generalization error D [f ] using results derived in the last lecture in ters of the covering nubers or fat-shattering diension of the real-valued function class F. Unfortunately, we cannot use these results to bound D [f ] directly in ters of [f ] using covering nubers of F, since l 0- is not Lipschitz. 2 However, for any bounded, Lipschitz loss function l l 0-, we have D [f ] er l D [f ], which in turn can be bounded in ters of er l [f ] using covering nubers of F. pecifically, let F ŶX for soe Ŷ R, and consider any loss function l : {, } Ŷ [0, ) for which there are constants B, L > 0 such that for all y {, } and all ŷ, ŷ, ŷ 2 Ŷ: (i) 0 l(y, ŷ) B; (ii) l(y, ŷ ) l(y, ŷ 2 ) L ŷ ŷ 2 ; and (iii) l 0- (y, ŷ) l(y, ŷ). Then with probability at least δ (over D ), where ɛ (fat F,, δ, B, L) is as described in Table. D [f ] er l D[f ] er l [f ] + ɛ (fat F,, δ, B, L), (3) Below we shall give several concrete exaples of loss functions satisfying the above conditions. Before we do so, it will be useful to introduce an alternative description of loss functions l : {, } Ŷ [0, ) that satisfy the following syetry condition (assuing Ŷ R is closed under negation): l(, ŷ) = l(, ŷ) ŷ Ŷ. (4) 2 We could bound D [f ] in ters of [f ] using covering nubers of the loss function class (l 0- ) F, but this siply takes us back to the growth function/vc-diension of sign(f).

Margin Analysis 3 For any such loss function l, we can write l(y, ŷ) = l(, yŷ) y {, }, ŷ Ŷ, (5) and therefore the loss function can equivalently be described using a single-variable function φ : Ŷ [0, ) given by φ(u) = l(, u): l(y, ŷ) = φ(yŷ) y {, }, ŷ Ŷ. (6) uch loss functions are often called argin-based loss functions, for reasons that will be clear later. As an exaple, the 0- loss above can be viewed as a argin-based loss with φ(u) = (u 0). On the other hand, an asyetric isclassification loss l : {, } {, } {0, a, b} where a b, l(y, y) = 0, l(, ) = a, and l(, ) = b, cannot be written as a argin-based loss. Margin-based losses are easily visualized through a plot of the function φ( ); they also satisfy the following properties, which can be easily verified: (M) 0 l B 0 φ B (M2) l L-Lipschitz in 2nd arguent φ L-Lipschitz (M3) l l 0- φ(u) (u 0) u Ŷ (M4) l convex in 2nd arguent φ convex. We now give several exaples of (argin-based) losses satisfying conditions (i)-(iii) above, which can be used to derive bounds on the 0- generalization error D [f ] in ters of the covering nubers/fat-shattering diension of F. Exaple (quared loss). Let Ŷ = [, ], and l sq (y, ŷ) = (y ŷ) 2 = ( yŷ) 2 y {, }, ŷ Ŷ. Then clearly, l sq satisfies conditions (i)-(iii) above with B = 4 and L = 4 (over Ŷ as above; see Figure 2). Therefore we can bound the generalization error D [f ] as follows: with probability at least δ (over D ), D [f ] er sq [f ] + ɛ (fat F,, δ, 4, 4). (7) Exaple 2 (Absolute loss). Let Ŷ = [, ], and l abs (y, ŷ) = y ŷ = yŷ y {, }, ŷ Ŷ. Then clearly, l abs satisfies properties (i)-(iii) above with B = 2 and L = (see Figure 2). Therefore we have with probability at least δ (over D ), Exaple 3 (Hinge loss). Let Ŷ = [, ], 0 < γ, and D [f ] er abs [f ] + ɛ (fat F,, δ, 2, ). (8) l hinge(γ) (y, ŷ) = γ (γ yŷ) + y {, }, ŷ Ŷ. (Recall z + = ax(0, z).) Clearly, l hinge(γ) satisfies properties (i)-(iii) above with B = (γ + )/γ and L = /γ (see Figure 2). Therefore we have with probability at least δ (over D ), D [f ] er hinge(γ) [f ] + ɛ (fat F,, δ, (γ + )/γ, /γ). (9) Exaple 4 (Rap loss). Let Ŷ = [, ], 0 < γ, and { if yŷ 0 l rap(γ) (y, ŷ) = l hinge(γ) (y, ŷ) if yŷ > 0 y {, }, ŷ Ŷ. Clearly, l rap(γ) satisfies properties (i)-(iii) above with B = and L = /γ (see Figure 2). Therefore we have with probability at least δ (over D ), D [f ] er rap(γ) [f ] + ɛ (fat F,, δ,, /γ). (0) Note that this bound is uniforly better than that using the hinge loss in Exaple 3, since er rap(γ) [f ] er hinge(γ) [f ] and the bound B here is also saller.

4 Margin Analysis Figure 2: Loss functions used in Exaples 4. Top left: quared loss l sq. Top right: Absolute loss l abs. Botto left: Hinge loss l hinge(γ) for γ = 0.4. Botto left: Rap loss l rap(γ) for γ = 0.4. The zero-one loss l 0- is shown in each case for coparison. All are argin-based loss functions and are plotted in ters of the corresponding functions φ (see ection 3). The above exaples illustrate how the generalization error D [f ] can be bounded in ters of the covering nubers/fat-shattering diension of F via generalization error bounds for f using a variety of different loss functions typically applied to real-valued learning probles. However there are several liitations with this approach as well: () ɛ is often not available in a closed for; (2) bounds in ters of fat F can often be quite loose; and (3) the bounds do not provide intuition about the proble, in particular they do not guide algorith design. Below we discuss a different approach for bounding the 0- generalization error D [f ] of a real-valued classifier sign(f ) which avoids these difficulties. 4 Bounding Generalization Error of sign(f ) via Margin Analysis Consider again the exaple in Figure. Can we explain our intuition that h should perfor better than h 2? One possible explanation for this is that while h 2 classifies all training exaples correctly, h has a better argin on ost of the exaples. Forally, define the argin of a real-valued classifier h = sign(f) (or siply of f) on an exaple (x, y) as argin(f, (x, y)) = yf(x). If the argin yf(x) is positive, then sign(f(x)) = y, i.e. f akes the correct prediction on (x, y); if the argin yf(x) is negative, then sign(f(x)) y, i.e. f akes the wrong prediction on (x, y). Moreover, a larger positive value of the argin yf(x) can be viewed as a ore confident prediction. In estiating the generalization error of h = sign(f ), we ay then want to take into account not only the nuber of

Margin Analysis 5 exaples in that are classified correctly by f, but the nuber of exaples that are classified correctly with a large argin. To foralize this idea, let γ > 0, and define the γ-argin loss l argin(γ) : {, } R {0, } as l argin(γ) (y, ŷ) = (yŷ γ). () (In the terinology of the previous section, this is clearly a argin-based loss.) The epirical γ-argin error of f then counts the fraction of training exaples that have argin less than γ: er argin(γ) [f ] = (y i f (x i ) γ). We cannot bound D [f ] in ters of er argin(γ) [f ] using the technique of the previous section since l argin(γ) is not Lipschitz. Instead, one can show directly the following one-sided unifor convergence result (strictly speaking, this is not unifor convergence, but rather a unifor bound ): Theore 4. (Bartlett, 998). Let F R X. Let γ > 0 and let D be any distribution on X {, }. Then for any ɛ > 0, ( ) P D f F : D [f] er argin(γ) [f] + ɛ 2N (γ/2, F, 2) e ɛ2 /8. i= The proof is siilar in structure to the previous unifor convergence proofs we have seen, with the sae four ain steps. Again, the ain difference is in tep 3 (reduction to a finite class). Details can be found in [2, ]; you are also encouraged to try the proof yourself! It is worth noting that, as in the case of previous unifor convergence results we have [ seen, the proof actually yields a ore refined bound in ters of the expected covering nubers E x 2 ] µ N (γ/2, F 2 x 2, d ) (where µ is the arginal of D on X ) rather than the unifor covering nubers N (γ/2, F, 2) = ax x 2 N (γ/2, F x 2, d ). However the expected covering nubers are typically difficult to estiate and so one generally falls back on the upper bound in ters of unifor covering nubers. 3 The above bound akes use of the d covering nubers of F rather than the d covering nubers that were used in the results of the last lecture. A ore iportant difference is that the covering nuber is easured at a scale of γ/2, and is independent of ɛ; this eans that when converting to a confidence bound, one gets a closed for expression. In particular, we have that with probability at least δ (over D ), D [f ] er argin(γ) [f ] + 8 ln (N (γ/2, F, 2)) + ln ( 2 δ ). (2) As before, the covering nuber N (γ/2, F, 2) needs to be sub-exponential in for the above result to be eaningful. For function classes F with finite pseudo-diension or finite fat-shattering diension, it is possible to obtain polynoial bounds on the d covering nubers of F in ters of these quantities, siilar to the bounds we discussed for the d covering nubers in the previous lecture; see [] for details. For certain function classes of interest, the d covering nubers can also be bounded directly. For exaple, we state below a bound on the covering nubers of classes of linear functions over bounded subsets of R n with bounded-nor weight vectors; details can be found in [3]: 4 Theore 4.2 (Zhang, 2002). Let R, W > 0. Let X = {x R n x 2 R} and F = {f : X R f(x) = w x for soe w R n, w 2 W }. Then γ > 0, N (γ, F, ) 36R2 W 2 γ 2 ( 4RW ln 2 γ ) + 2 +. 3 Note that the expected covering nubers are soeties referred to as rando covering nubers in the literature; this is a isnoer as there is no randoness left once you take the expectation. 4 Note that since d d, any upper bounds on the d covering nubers of a function class also iply upper bounds on its d covering nubers.

6 Margin Analysis Thus for linear classifiers h = sign(f ) where f is learned fro the class F in the above theore, we have the following generalization error bound: for any γ > 0 and 0 < δ, with probability at least δ (over D ), D [f ] er argin(γ) [f ] + c R 2 W 2 γ 2 ln + ln ( δ ). (3) Note that if we can ake the first ter in the bound sall for a larger value of γ (i.e. achieve a good separation on the data with a larger argin γ), then the coplexity ter on the right becoes saller, iplying a better generalization error bound. This suggests that algoriths achieving a large argin on the training saple ay lead to good generalization perforance. However it is iportant to note a caveat here: the above bound holds only when the argin paraeter γ > 0 is fixed in advance, before seeing the data; it cannot be chosen after seeing the saple. In a later lecture, we will see how to extend this to a unifor argin bound that holds uniforly over all γ in soe bounded range, and will therefore apply even when γ is chosen based on the data. In closing, we note that the argin idea can be extended to learning probles other than binary classification; see for exaple [3]. Exercise. Let X R n, and let = ((x, y ),..., (x, y )) (X {, }). ay you learn fro a linear classifier h = sign(w x) using the VM algorith with soe value of C: ( ) w = arg in w R n 2 w 2 2 + C l hinge (y i, w x i ). Can you bound w 2? (Hint: consider what you get with w = 0.) i= 5 Next Lecture In the next lecture, we will derive generalization error bounds for both classification and regression using Radeacher averages, which provide a distribution-dependent easure of the coplexity of a function class whose epirical (data-dependent) analogues can often be estiated reliably fro data, and can lead to tighter bounds than what can be obtained with distribution-independent coplexity easures such as the growth function/vc-diension or covering nubers/fat-shattering diension. References [] Martin Anthony and Peter L. Bartlett. Learning in Neural Networks: Theoretical Foundations. Cabridge University Press, 999. [2] Peter L. Bartlett. The saple coplexity of pattern classification with neural networks: the size of the weights is ore iportant than the size of the network. IEEE Transactions on Inforation Theory, 44(2):525 536, 998. [3] Tong Zhang. Covering nuber bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527 550, 2002.