E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Similar documents
E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

Computational and Statistical Learning Theory

1 Bounding the Margin

Symmetrization and Rademacher Averages

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

1 Proof of learning bounds

Computable Shell Decomposition Bounds

Computational and Statistical Learning Theory

3.8 Three Types of Convergence

1 Generalization bounds based on Rademacher complexity

Computable Shell Decomposition Bounds

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

A Simple Regression Problem

Kernel Methods and Support Vector Machines

Combining Classifiers

Machine Learning Basics: Estimators, Bias and Variance

Support Vector Machines MIT Course Notes Cynthia Rudin

1 Rademacher Complexity Bounds

Understanding Machine Learning Solution Manual

CS Lecture 13. More Maximum Likelihood

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

VC Dimension and Sauer s Lemma

Feature Extraction Techniques

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

Support Vector Machines. Maximizing the Margin

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Introduction to Machine Learning. Recitation 11

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Soft-margin SVM can address linearly separable problems with outliers

Block designs and statistics

Learnability and Stability in the General Learning Setting

Ensemble Based on Data Envelopment Analysis

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Stochastic Subgradient Methods

The Weierstrass Approximation Theorem

Learnability of Gaussians with flexible variances

COS 424: Interacting with Data. Written Exercises

Non-Parametric Non-Line-of-Sight Identification 1

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Probability Distributions

On the Use of A Priori Information for Sparse Signal Approximations

Testing Properties of Collections of Distributions

Robustness and Regularization of Support Vector Machines

Machine Learning: Fisher s Linear Discriminant. Lecture 05

Support Vector Machines. Goals for the lecture

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Sharp Time Data Tradeoffs for Linear Inverse Problems

Stability Bounds for Non-i.i.d. Processes

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

A Theoretical Framework for Deep Transfer Learning

A Theoretical Analysis of a Warm Start Technique

Supplement to: Subsampling Methods for Persistent Homology

On Conditions for Linearity of Optimal Estimation

Probabilistic Machine Learning

Pattern Recognition and Machine Learning. Artificial Neural networks

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Pattern Recognition and Machine Learning. Artificial Neural networks

Bootstrapping Dependent Data

Nonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Understanding Generalization Error: Bounds and Decompositions

Bayes Decision Rule and Naïve Bayes Classifier

Boosting with log-loss

arxiv: v2 [stat.ml] 23 Feb 2016

Analyzing Simulation Results

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

arxiv: v3 [cs.lg] 7 Jan 2016

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Estimating Parameters for a Gaussian pdf

Geometrical intuition behind the dual problem

Fourier Series Summary (From Salivahanan et al, 2002)

1 Proving the Fundamental Theorem of Statistical Learning

Necessity of low effective dimension

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

arxiv: v1 [cs.ds] 17 Mar 2016

Fairness via priority scheduling

A theory of learning from different domains

Least Squares Fitting of Data

Ch 12: Variations on Backpropagation

Pattern Recognition and Machine Learning. Artificial Neural networks

ZISC Neural Network Base Indicator for Classification Complexity Estimation

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis

In this chapter, we consider several graph-theoretic and probabilistic models

ORIE 6340: Mathematics of Data Science

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,

Error Exponents in Asynchronous Communication

Lecture 3: October 2, 2017

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Ph 20.3 Numerical Solution of Ordinary Differential Equations

PAC-Bayesian Learning of Linear Classifiers

16 Independence Definitions Potential Pitfall Alternative Formulation. mcs-ftl 2010/9/8 0:40 page 431 #437

Solutions of some selected problems of Homework 4

Transcription:

E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds on the generalization error of functions learned fro function classes of liited capacity, easured in ters of the growth function and VC-diension for binary-valued function classes in the case of binary classification, and in ters of the covering nubers, pseudo-diension, and fat-shattering diension for real-valued function classes in the case of regression. Table suarizes the nature of results we have obtained. In this lecture, we return to binary classification, but focus on function classes H of the for H = sign(f) for soe real-valued function class F R X. If an algorith learns a function h : X {, } of the for h (x) = sign(f (x)) for soe f F, can we say ore about its generalization error than for a general binary-valued function h : X {, }? Note that the VM algorith falls in this category, while the basic forulations of decision tree and nearest neighbour classification algoriths do not. Let us first see what we can say about the generalization error of h = sign(f ) using the techniques we have learned so far. 2 Bounding Generalization Error of sign(f ) via VCdi(sign(F)) One approach to bounding the generalization error of h sign(f) is to consider the VC-diension of the binary-valued function class sign(f). If VCdi(sign(F)) is finite, then this gives that for any 0 < δ, with probability at least δ (over the draw of D ), where c > 0 is soe fixed constant. D [h ] [h ] + c VCdi(sign(F)) ln + ln ( ) δ However this approach is liited. Consider for exaple the two classifiers shown in Figure. Both are selected fro the class of linear classifiers in R 2 : H = sign(f), where F = {f : R 2 R f(x) = w x + b for soe w R n, b R}. We saw earlier that the VC-diension of this class is 3. The epirical error of h on the training saple is er [h ] = /; for h 2, this is er [h 2 ] = 0. o the above result would give a better bound on the generalization error of h 2 than that of h, even though any of us would intuitively expect h to perfor better on new exaples than h 2. () Figure : Two linear classifiers. 3 Bounding Generalization Error of sign(f ) via Bounds for f As noted above, bounding the generalization error of h = sign(f ) using only the binary values output by sign(f ) has its liitations. Therefore we would like to take into account the real-valued predictions ade There are variants of decision tree and nearest neighbour algoriths that take into account class probabilities or distancebased weights and could be described in the above for, but the basic forulations are not naturally described in this way.

2 Margin Analysis Table : uary of generalization error bounds we have seen so far. Learning Function Loss Capacity Bound on proble class function easures generalization error Binary H {, } X l 0- Π H () w.p. δ (over D ): classification VCdi(H) D [h ] [h ] + c VCdi(H) ln +ln( δ ) Regression F R X l : N (ɛ, F, ) w.p. δ (over D ): 0 l B Pdi(F) er l D [f ] er l [f ] + ɛ (fat F,, δ, B, L) L-Lipschitz fat F (ɛ) (Here ɛ (fat F,, δ, B, L) is the sallest value of ɛ for which the bound obtained in the last lecture is saller than δ; no closed-for expression in general.) Binary H = sign(f) l 0-???? classification F R X using real-valued functions by f. To this end, and to siplify later stateents, let us first extend the l 0- loss to real-valued predictions as follows: define l 0- : {, } R {0, } as l 0- (y, ŷ) = (sign(ŷ) y). (2) This gives for exaple D [f ] = P (x,y) D (sign(f (x)) y) (= D [h ] for h = sign(f )). We can then consider obtaining bounds on the generalization error D [f ] using results derived in the last lecture in ters of the covering nubers or fat-shattering diension of the real-valued function class F. Unfortunately, we cannot use these results to bound D [f ] directly in ters of [f ] using covering nubers of F, since l 0- is not Lipschitz. 2 However, for any bounded, Lipschitz loss function l l 0-, we have D [f ] er l D [f ], which in turn can be bounded in ters of er l [f ] using covering nubers of F. pecifically, let F ŶX for soe Ŷ R, and consider any loss function l : {, } Ŷ [0, ) for which there are constants B, L > 0 such that for all y {, } and all ŷ, ŷ, ŷ 2 Ŷ: (i) 0 l(y, ŷ) B; (ii) l(y, ŷ ) l(y, ŷ 2 ) L ŷ ŷ 2 ; and (iii) l 0- (y, ŷ) l(y, ŷ). Then with probability at least δ (over D ), where ɛ (fat F,, δ, B, L) is as described in Table. D [f ] er l D[f ] er l [f ] + ɛ (fat F,, δ, B, L), (3) Below we shall give several concrete exaples of loss functions satisfying the above conditions. Before we do so, it will be useful to introduce an alternative description of loss functions l : {, } Ŷ [0, ) that satisfy the following syetry condition (assuing Ŷ R is closed under negation): l(, ŷ) = l(, ŷ) ŷ Ŷ. (4) 2 We could bound D [f ] in ters of [f ] using covering nubers of the loss function class (l 0- ) F, but this siply takes us back to the growth function/vc-diension of sign(f).

Margin Analysis 3 For any such loss function l, we can write l(y, ŷ) = l(, yŷ) y {, }, ŷ Ŷ, (5) and therefore the loss function can equivalently be described using a single-variable function φ : Ŷ [0, ) given by φ(u) = l(, u): l(y, ŷ) = φ(yŷ) y {, }, ŷ Ŷ. (6) uch loss functions are often called argin-based loss functions, for reasons that will be clear later. As an exaple, the 0- loss above can be viewed as a argin-based loss with φ(u) = (u 0). On the other hand, an asyetric isclassification loss l : {, } {, } {0, a, b} where a b, l(y, y) = 0, l(, ) = a, and l(, ) = b, cannot be written as a argin-based loss. Margin-based losses are easily visualized through a plot of the function φ( ); they also satisfy the following properties, which can be easily verified: (M) 0 l B 0 φ B (M2) l L-Lipschitz in 2nd arguent φ L-Lipschitz (M3) l l 0- φ(u) (u 0) u Ŷ (M4) l convex in 2nd arguent φ convex. We now give several exaples of (argin-based) losses satisfying conditions (i)-(iii) above, which can be used to derive bounds on the 0- generalization error D [f ] in ters of the covering nubers/fat-shattering diension of F. Exaple (quared loss). Let Ŷ = [, ], and l sq (y, ŷ) = (y ŷ) 2 = ( yŷ) 2 y {, }, ŷ Ŷ. Then clearly, l sq satisfies conditions (i)-(iii) above with B = 4 and L = 4 (over Ŷ as above; see Figure 2). Therefore we can bound the generalization error D [f ] as follows: with probability at least δ (over D ), D [f ] er sq [f ] + ɛ (fat F,, δ, 4, 4). (7) Exaple 2 (Absolute loss). Let Ŷ = [, ], and l abs (y, ŷ) = y ŷ = yŷ y {, }, ŷ Ŷ. Then clearly, l abs satisfies properties (i)-(iii) above with B = 2 and L = (see Figure 2). Therefore we have with probability at least δ (over D ), Exaple 3 (Hinge loss). Let Ŷ = [, ], 0 < γ, and D [f ] er abs [f ] + ɛ (fat F,, δ, 2, ). (8) l hinge(γ) (y, ŷ) = γ (γ yŷ) + y {, }, ŷ Ŷ. (Recall z + = ax(0, z).) Clearly, l hinge(γ) satisfies properties (i)-(iii) above with B = (γ + )/γ and L = /γ (see Figure 2). Therefore we have with probability at least δ (over D ), D [f ] er hinge(γ) [f ] + ɛ (fat F,, δ, (γ + )/γ, /γ). (9) Exaple 4 (Rap loss). Let Ŷ = [, ], 0 < γ, and { if yŷ 0 l rap(γ) (y, ŷ) = l hinge(γ) (y, ŷ) if yŷ > 0 y {, }, ŷ Ŷ. Clearly, l rap(γ) satisfies properties (i)-(iii) above with B = and L = /γ (see Figure 2). Therefore we have with probability at least δ (over D ), D [f ] er rap(γ) [f ] + ɛ (fat F,, δ,, /γ). (0) Note that this bound is uniforly better than that using the hinge loss in Exaple 3, since er rap(γ) [f ] er hinge(γ) [f ] and the bound B here is also saller.

4 Margin Analysis Figure 2: Loss functions used in Exaples 4. Top left: quared loss l sq. Top right: Absolute loss l abs. Botto left: Hinge loss l hinge(γ) for γ = 0.4. Botto left: Rap loss l rap(γ) for γ = 0.4. The zero-one loss l 0- is shown in each case for coparison. All are argin-based loss functions and are plotted in ters of the corresponding functions φ (see ection 3). The above exaples illustrate how the generalization error D [f ] can be bounded in ters of the covering nubers/fat-shattering diension of F via generalization error bounds for f using a variety of different loss functions typically applied to real-valued learning probles. However there are several liitations with this approach as well: () ɛ is often not available in a closed for; (2) bounds in ters of fat F can often be quite loose; and (3) the bounds do not provide intuition about the proble, in particular they do not guide algorith design. Below we discuss a different approach for bounding the 0- generalization error D [f ] of a real-valued classifier sign(f ) which avoids these difficulties. 4 Bounding Generalization Error of sign(f ) via Margin Analysis Consider again the exaple in Figure. Can we explain our intuition that h should perfor better than h 2? One possible explanation for this is that while h 2 classifies all training exaples correctly, h has a better argin on ost of the exaples. Forally, define the argin of a real-valued classifier h = sign(f) (or siply of f) on an exaple (x, y) as argin(f, (x, y)) = yf(x). If the argin yf(x) is positive, then sign(f(x)) = y, i.e. f akes the correct prediction on (x, y); if the argin yf(x) is negative, then sign(f(x)) y, i.e. f akes the wrong prediction on (x, y). Moreover, a larger positive value of the argin yf(x) can be viewed as a ore confident prediction. In estiating the generalization error of h = sign(f ), we ay then want to take into account not only the nuber of

Margin Analysis 5 exaples in that are classified correctly by f, but the nuber of exaples that are classified correctly with a large argin. To foralize this idea, let γ > 0, and define the γ-argin loss l argin(γ) : {, } R {0, } as l argin(γ) (y, ŷ) = (yŷ γ). () (In the terinology of the previous section, this is clearly a argin-based loss.) The epirical γ-argin error of f then counts the fraction of training exaples that have argin less than γ: er argin(γ) [f ] = (y i f (x i ) γ). We cannot bound D [f ] in ters of er argin(γ) [f ] using the technique of the previous section since l argin(γ) is not Lipschitz. Instead, one can show directly the following one-sided unifor convergence result (strictly speaking, this is not unifor convergence, but rather a unifor bound ): Theore 4. (Bartlett, 998). Let F R X. Let γ > 0 and let D be any distribution on X {, }. Then for any ɛ > 0, ( ) P D f F : D [f] er argin(γ) [f] + ɛ 2N (γ/2, F, 2) e ɛ2 /8. i= The proof is siilar in structure to the previous unifor convergence proofs we have seen, with the sae four ain steps. Again, the ain difference is in tep 3 (reduction to a finite class). Details can be found in [2, ]; you are also encouraged to try the proof yourself! It is worth noting that, as in the case of previous unifor convergence results we have [ seen, the proof actually yields a ore refined bound in ters of the expected covering nubers E x 2 ] µ N (γ/2, F 2 x 2, d ) (where µ is the arginal of D on X ) rather than the unifor covering nubers N (γ/2, F, 2) = ax x 2 N (γ/2, F x 2, d ). However the expected covering nubers are typically difficult to estiate and so one generally falls back on the upper bound in ters of unifor covering nubers. 3 The above bound akes use of the d covering nubers of F rather than the d covering nubers that were used in the results of the last lecture. A ore iportant difference is that the covering nuber is easured at a scale of γ/2, and is independent of ɛ; this eans that when converting to a confidence bound, one gets a closed for expression. In particular, we have that with probability at least δ (over D ), D [f ] er argin(γ) [f ] + 8 ln (N (γ/2, F, 2)) + ln ( 2 δ ). (2) As before, the covering nuber N (γ/2, F, 2) needs to be sub-exponential in for the above result to be eaningful. For function classes F with finite pseudo-diension or finite fat-shattering diension, it is possible to obtain polynoial bounds on the d covering nubers of F in ters of these quantities, siilar to the bounds we discussed for the d covering nubers in the previous lecture; see [] for details. For certain function classes of interest, the d covering nubers can also be bounded directly. For exaple, we state below a bound on the covering nubers of classes of linear functions over bounded subsets of R n with bounded-nor weight vectors; details can be found in [3]: 4 Theore 4.2 (Zhang, 2002). Let R, W > 0. Let X = {x R n x 2 R} and F = {f : X R f(x) = w x for soe w R n, w 2 W }. Then γ > 0, N (γ, F, ) 36R2 W 2 γ 2 ( 4RW ln 2 γ ) + 2 +. 3 Note that the expected covering nubers are soeties referred to as rando covering nubers in the literature; this is a isnoer as there is no randoness left once you take the expectation. 4 Note that since d d, any upper bounds on the d covering nubers of a function class also iply upper bounds on its d covering nubers.

6 Margin Analysis Thus for linear classifiers h = sign(f ) where f is learned fro the class F in the above theore, we have the following generalization error bound: for any γ > 0 and 0 < δ, with probability at least δ (over D ), D [f ] er argin(γ) [f ] + c R 2 W 2 γ 2 ln + ln ( δ ). (3) Note that if we can ake the first ter in the bound sall for a larger value of γ (i.e. achieve a good separation on the data with a larger argin γ), then the coplexity ter on the right becoes saller, iplying a better generalization error bound. This suggests that algoriths achieving a large argin on the training saple ay lead to good generalization perforance. However it is iportant to note a caveat here: the above bound holds only when the argin paraeter γ > 0 is fixed in advance, before seeing the data; it cannot be chosen after seeing the saple. In a later lecture, we will see how to extend this to a unifor argin bound that holds uniforly over all γ in soe bounded range, and will therefore apply even when γ is chosen based on the data. In closing, we note that the argin idea can be extended to learning probles other than binary classification; see for exaple [3]. Exercise. Let X R n, and let = ((x, y ),..., (x, y )) (X {, }). ay you learn fro a linear classifier h = sign(w x) using the VM algorith with soe value of C: ( ) w = arg in w R n 2 w 2 2 + C l hinge (y i, w x i ). Can you bound w 2? (Hint: consider what you get with w = 0.) i= 5 Next Lecture In the next lecture, we will derive generalization error bounds for both classification and regression using Radeacher averages, which provide a distribution-dependent easure of the coplexity of a function class whose epirical (data-dependent) analogues can often be estiated reliably fro data, and can lead to tighter bounds than what can be obtained with distribution-independent coplexity easures such as the growth function/vc-diension or covering nubers/fat-shattering diension. References [] Martin Anthony and Peter L. Bartlett. Learning in Neural Networks: Theoretical Foundations. Cabridge University Press, 999. [2] Peter L. Bartlett. The saple coplexity of pattern classification with neural networks: the size of the weights is ore iportant than the size of the network. IEEE Transactions on Inforation Theory, 44(2):525 536, 998. [3] Tong Zhang. Covering nuber bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527 550, 2002.