E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

Similar documents
E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Computational and Statistical Learning Theory

1 Generalization bounds based on Rademacher complexity

VC Dimension and Sauer s Lemma

Symmetrization and Rademacher Averages

1 Proof of learning bounds

3.8 Three Types of Convergence

Understanding Machine Learning Solution Manual

Prerequisites. We recall: Theorem 2 A subset of a countably innite set is countable.

Computational and Statistical Learning Theory

Kernel Methods and Support Vector Machines

Understanding Generalization Error: Bounds and Decompositions

1 Bounding the Margin

1 Rademacher Complexity Bounds

Question 1. Question 3. Question 4. Graduate Analysis I Exercise 4

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Supplement to: Subsampling Methods for Persistent Homology

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Computable Shell Decomposition Bounds

Machine Learning Basics: Estimators, Bias and Variance

Course 212: Academic Year Section 1: Metric Spaces

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I

Consistency of Nearest Neighbor Methods

Computable Shell Decomposition Bounds

Support Vector Machines. Maximizing the Margin

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

arxiv: v1 [math.co] 19 Apr 2017

U V. r In Uniform Field the Potential Difference is V Ed

Chapter 6 1-D Continuous Groups

PREPRINT 2006:17. Inequalities of the Brunn-Minkowski Type for Gaussian Measures CHRISTER BORELL

Solutions 1. Introduction to Coding Theory - Spring 2010 Solutions 1. Exercise 1.1. See Examples 1.2 and 1.11 in the course notes.

Max-Product Shepard Approximation Operators

Introduction to Machine Learning (67577) Lecture 3

4 = (0.02) 3 13, = 0.25 because = 25. Simi-

Introduction to Discrete Optimization

Combining Classifiers

A Simple Regression Problem

Learnability and Stability in the General Learning Setting

FAST DYNAMO ON THE REAL LINE

Sample width for multi-category classifiers

1 Proving the Fundamental Theorem of Statistical Learning

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

New upper bound for the B-spline basis condition number II. K. Scherer. Institut fur Angewandte Mathematik, Universitat Bonn, Bonn, Germany.

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Learnability of Gaussians with flexible variances

Statistical and Computational Learning Theory

Block designs and statistics

CS Lecture 13. More Maximum Likelihood

Introduction to Statistical Learning Theory

Least Squares Fitting of Data

Robustness and Regularization of Support Vector Machines

The Weierstrass Approximation Theorem

Support Vector Machines MIT Course Notes Cynthia Rudin

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

List Scheduling and LPT Oliver Braun (09/05/2017)

Symmetrization of Lévy processes and applications 1

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Computational and Statistical Learning theory Assignment 4

CIS 520: Machine Learning Oct 09, Kernel Methods

Solutions of some selected problems of Homework 4

Probability and Stochastic Processes: A Friendly Introduction for Electrical and Computer Engineers Roy D. Yates and David J.

Class 2 & 3 Overfitting & Regularization

Testing Properties of Collections of Distributions

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

arxiv: v2 [stat.ml] 23 Feb 2016

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

A1. Find all ordered pairs (a, b) of positive integers for which 1 a + 1 b = 3

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Stability Bounds for Non-i.i.d. Processes

Certain Subalgebras of Lipschitz Algebras of Infinitely Differentiable Functions and Their Maximal Ideal Spaces

Lecture 21. Interior Point Methods Setup and Algorithm

A Bernstein-Markov Theorem for Normed Spaces

Curious Bounds for Floor Function Sums

Lecture 21 Principle of Inclusion and Exclusion

lecture 36: Linear Multistep Mehods: Zero Stability

arxiv: v1 [math.nt] 14 Sep 2014

Fixed-to-Variable Length Distribution Matching

Tight Bounds for Maximal Identifiability of Failure Nodes in Boolean Network Tomography

Metric spaces and metrizability

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

Energy-Efficient Threshold Circuits Computing Mod Functions

A Theoretical Framework for Deep Transfer Learning

A := A i : {A i } S. is an algebra. The same object is obtained when the union in required to be disjoint.

Feature Extraction Techniques

Nonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

The isomorphism problem of Hausdorff measures and Hölder restrictions of functions

Approximation by Piecewise Constants on Convex Partitions

Lecture 4: Completion of a Metric Space

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Infinitely Many Trees Have Non-Sperner Subtree Poset

Probabilistic Machine Learning

Metric Entropy of Convex Hulls

Machine Learning Theory (CS 6783)

Least Squares Regression

Transcription:

E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how to obtain high confidence bounds on the generalization error er 0- D [h S] of a binary classifier h S learned by an algorith fro a function class H, X of liited capacity, using the ideas of unifor convergence. We saw the use of the growth function Π H to easure the capacity of the class H, as well as the VC-diension VCdiH, which provides a one-nuber suary of the capacity of H. In this lecture we will consider the proble of regression, or learning of real-valued functions. Here we are given a training saple S = x, y,..., x, y X Y for soe Y R, assued now to be drawn fro D for soe distribution D on X Y, and the goal is to learn fro this saple a real-valued function f S : X R that has low generalization error er l D [f S] w.r.t. soe appropriate loss function l : Y R [0,, such as the squared loss l sq given by l sq y, ŷ = ŷ y. As before, we will be interested in obtaining high confidence bounds on the generalization error of the learned function, er l D [f S]. Again, we will consider learning algoriths that learn f S fro a function class F R X of liited capacity, and obtain generalization error bounds for such algoriths via a unifor convergence result that upper bounds the probability P S D sup er l D[f] er l S[f]. However in order to derive such a result, we will need a different notion of capacity for a class of real-valued functions F. In particular, we will use the covering nubers of F, which will play a role analogous to that played by the growth function in the case of binary-valued function classes. Covering Nubers We start by considering covering nubers of subsets of a general etric space. We will then specialize this to subsets of Euclidean space, and use this to define covering nubers for a real-valued function class.. Covering Nubers in a General Metric Space Let A, d be a etric space. Let W A and let > 0. A set C W is said to be a proper -cover of W w.r.t. d if for every w W, c C such that dw, c <. In other words, C W is an -cover of W w.r.t. d if the union of open d-balls of radius centered at points in C contains W : 3 B d, c W. c C If W has a finite -cover w.r.t. d, then we define the -covering nuber of W w.r.t. d to be the cardinality of the sallest -cover of W : N, W, d = in C C is an -cover of W w.r.t. d. Recall that a etric space A, d consists of a set A together with a etric d : A A [0, that satisfies the following for all x, y, z A: dx, y = 0 x = y; dx, y = dy, x; and 3 dx, z dx, y + dx, z. Soeties it is also convenient to consider iproper covers C A which need not be contained in W. 3 Recall that the open d-ball centered at x A is defined as B d, x = y A dx, y <.

Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension If W does not have a finite -cover w.r.t. d, we take N, W, d =. Thus the covering nubers of W can be viewed as easuring the extent of W in A, d at granularity or scale.. Covering Nubers in Euclidean Space Consider now A = R n. following: We can define a nuber of different etrics on R n, including in particular the d x, x = n x i x n i 3 i= d x, x = n x i x i n i= d x, x = ax x i x i. 5 i Accordingly, for any W R n, we can define the corresponding covering nubers N, W, d p for p =,,. It is easy to see that d x, x d x, x by Jensen s inequality and that d x, x d x, x, fro which it follows that the corresponding covering nubers satisfy the relation N, W, d N, W, d N, W, d. 6.3 Unifor Covering Nubers for a Real-Valued Function Class Now let F be a class of real-valued functions on X, and let x = x,..., x X. Then F x R. For any > 0 and N, the unifor d p covering nubers of F for p =,, are defined as N p, F, = ax N, F x X x, d p 7 if N, F x, d p is finite for all x X, and N p, F, = otherwise. This should be copared with the definition of growth function for a class of binary-valued functions H, which also involved a axiu over x X : in that case, H x was finite, and the axiu was over the cardinality of H x ; here, F x ay in general be infinite, and the axiu is over the extent of F x in R at scale, as easured using the etric d p. In particular, for H, X, we have that for any, N, H x, d = H x, and therefore N, H, = Π H. Thus the unifor covering nubers can be viewed as generalizing the notion of growth function to classes of real-valued functions. Note that the ter unifor here refers to the axiu over all x X of the covering nubers of F x in R, and is unrelated to the use of the ter unifor in unifor convergence, which refers to the supreu over functions f F. Unless otherwise stated, in what follows we will refer to the unifor covering nubers of a function class F as siply the covering nubers of F. 3 Unifor Convergence in a Real-Valued Function class F We can assue that functions in F take values in soe set Ŷ R, so that F ŶX. We will require the loss function l to be bounded, i.e. we will assue B > 0 such that 0 ly, ŷ B y Y, ŷ Ŷ. We will find it useful to define for any function class F ŶX and loss l : Y Ŷ [0, B] the loss function class l F [0, B] X Y given by l F = l f : X Y [0, B] l f x, y = ly, fx for soe f F. 8 We will first prove a unifor convergence result for general losses l as above in ters of the d covering nubers of the loss function class l F, and will then show that for any losses l, including the squared loss when Y and Ŷ are bounded, the d covering nubers of l F can further be bounded in ters of the d covering nubers of F.

Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension 3 Theore 3.. Let Y, Ŷ R. Let F ŶX, and let l : Y Ŷ [0, B]. Let D be any distribution on X Y. For any > 0, P S D sup er l D[f] er l S[f] N /8, lf, e /3B. 9 Proof. The proof uses siilar techniques as in the proof of unifor convergence for the l 0- loss in the binary case that we saw in Lecture 3, and has the sae broad steps. The key difference is in the reduction to a finite class step 3. Step : Syetrization. Following the sae steps as in Lecture 3, we can show that for 8B, P S D sup er l D[h] er l S[h] P S, S D D sup er l S[f] er l S[f]. 0 Step : Swapping perutations. Again using the sae arguent as in Lecture 3, we can show that P S, S D D sup er l S[f] er l S[f] ] [P σ Γ sup er l σs [f] erl σ S [f]. sup S, S X Y Step 3: Reduction to a finite class. Fix any S, S X Y, and for siplicity, define x +i, y +i = x i, ỹ i i []. Now consider l F S, S [0, B]. Let G F be such that l G S, S is an /8-cover of l F S, S w.r.t. d. Clearly, we can take G = N /8, l F S, S, d N /8, l F, ; since l F aps to a bounded interval, this is a finite nuber. Now consider any σ Γ. We clai that if f F such that er l σs [f] erl σ S [f], then g G such that er l σs [g] erl σ S [g]. 3 To see this, let f F be such that holds. Take any g G for which i= l f x i, y i l g x i, y i < 8. Such a g exists since l G S, S is an /8-cover of l F S, S w.r.t. d. We will show that g satisfies 3. In particular, we have er l σs [f] erl σ S [f] = = < er l σs [f] erl σs [g] er l er σs [f] erl σs [g] + l er l σs [g] erl σ S [g] + er l σs [g] erl σ S [g] + er l σs [g] erl σ S [g] + 5 er lσ S [f] erlσ S [g] + er l σs [g] erl σ S [g] 6 er σ S [f] erl σ S [g] + l σs [g] er l σ S [g] 7 l f x i, y i l g x i, y i + l f x σi, y σi l g x σi, y σi 8 i= i= i=+ l f x σi, y σi l g x σi, y σi 9 by. 0 Y can be thought of as the space of true labels, and Ŷ the space of predicted labels.

Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension The clai follows. Thus we have P σ Γ sup er l σs [f] erl σ S [f] P σ Γ ax er l σs [g] g G erl σ S [g] er N /8, l F, ax P l σ Γ σs [g] er l g G σ S [g] by union bound. Step : Hoeffding s inequality. As in Lecture 3, Hoeffding s inequality can now be used to show that for any g G, er l P σ Γ σs [g] er l σ S [g] 3 = P r, r i l g x i, y i l g x i, y i i= e /3B. 5 Putting everything together yields the desired result for 8B ; for < 8B, the result holds trivially. The above result yields a high-confidence bound on the generalization error of a function learned fro F in ters of covering nubers of l F. For well-behaved loss functions l, these can be further bounded in ters of covering nubers of F: Lea 3.. Let Y, Ŷ R. Let F ŶX, and let l : Y Ŷ [0, B]. If l is Lipschitz in its second arguent with Lipschitz constant L > 0, i.e. ly, ŷ ly, ŷ L ŷ ŷ y Y, ŷ, ŷ Ŷ, then for any N, N, lf, N /L, F,. Proof. Let S = x, y,..., x, y X Y, and let f, g F. Then l f x i, y i l g x i, y i = i= L ly i, fx i ly i, gx i 6 i= fx i gx i. 7 i= Thus any d /L-cover for F x is a d -cover for l F S, which iplies the result. This yields the following corollary: Corollary 3.3. Let Y, Ŷ R, F ŶX, and l : Y Ŷ [0, B] such that l is Lipschitz in its second arguent with Lipschitz constant L > 0. Let D be any distribution on X Y. Then for any > 0: P S D sup er l D[f] er l S[f] N /8L, F, e /3B. 8 As an exaple, consider the squared loss l sq y, ŷ = ŷ y. It can be shown that if Y, Ŷ are bounded, then l sq is bounded and Lipschitz. In particular, for Y = Ŷ = [, ], we have 0 l sqy, ŷ y Y, ŷ Ŷ,

Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension 5 and l sq is Lipschitz with Lipschitz constant L = : l sq y, ŷ l sq y, ŷ = = y ŷ y ŷ 9 ŷ ŷ yŷ ŷ 30 ŷ + ŷ ŷ ŷ + y ŷ ŷ 3 ŷ ŷ since y, ŷ, ŷ [, ]. 3 Thus when both labels and predictions are in [, ], one gets for the squared loss: Corollary 3.. Let Y = Ŷ = [, ], F ŶX, and l sq : Y Ŷ [0, ] be given by l sqy, ŷ = ŷ y. Let D be any distribution on X Y. Then for any > 0: P S D sup er sq D [f] ersq S [f] N /3, F, e /5. 33 Exercise. Show that for Y = Ŷ = [, ], the absolute loss given by l absy, ŷ = ŷ y y Y, ŷ Ŷ is bounded and is Lipschitz in its second arguent with Lipschitz constant L =. Pseudo-Diension and Fat-Shattering Diension Just as the growth function Π H needed to be sub-exponential in for the unifor convergence result in the binary classification case to be eaningful, the covering nubers N /8, l F, or N /8L, F, need to be sub-exponential in for the above result to be eaningful. In the binary case, we saw that if the VC-diension of H is finite, then the growth function of H grows polynoially in. Analogous results can be shown to hold for the covering nubers. We provide soe basic definitions and results here; further details can be found for exaple in []. Definition Pseudo-diension. Let F R X and let x = x,..., x X. We say x is pseudoshattered by F if r = r,..., r R such that b = b,..., b,, f b F such that signf b x i r i = b i i []. The pseudo-diension of F is the cardinality of the largest set of points in X that can be pseudo-shattered by F: PdiF = ax N x X such that x is pseudo-shattered by F. If F pseudo-shatters arbitrarily large sets of points in X, we say PdiF =. Fact. If F is a vector space of real-valued functions, then PdiF = dif. For exaple, for the class of all affine functions over R n given by F = f : R n R fx = w x + b for soe w R n, b R, we have PdiF = n +. Clearly, if F F, PdiF PdiF. Definition Fat-shattering diension. Let F R X and let x = x,..., x X. Let γ > 0. We say x is γ-shattered by F if r = r,..., r R such that b = b,..., b,, f b F such that b i f b x i r i γ i []. The γ-diension of F or the fat-shattering diension of F at scale γ is the cardinality of the largest set of points in X that can be γ-shattered by F: fat F γ = ax N x X such that x is γ-shattered by F. If F γ-shatters arbitrarily large sets of points in X, we say fat F γ =. Clearly, fat F γ PdiF γ > 0. The fat-shattering diension is often called a scale-sensitive diension since it depends on the scale γ. Both quantities, when finite, can be used to bound the covering nubers of a function class F whose functions take values in a bounded range:

6 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Theore.. Let F [a, b] X for soe a b. Let 0 < b a, and let fat F /8 = d <. Then for d, d log /d N, F, = O. Theore.. Let F [a, b] X N, for soe a b. Let PdiF = d <. Then for all 0 < b a and d N, F, = O. 5 Next Lecture In the next lecture, we will return to binary classification, but will focus on learning functions of the for hx = signfx for soe real-valued function f, and will see how the quantities considered in this lecture can be useful in such situations. References [] Martin Anthony and Peter L. Bartlett. Learning in Neural Networks: Theoretical Foundations. Cabridge University Press, 999.