E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how to obtain high confidence bounds on the generalization error er 0- D [h S] of a binary classifier h S learned by an algorith fro a function class H, X of liited capacity, using the ideas of unifor convergence. We saw the use of the growth function Π H to easure the capacity of the class H, as well as the VC-diension VCdiH, which provides a one-nuber suary of the capacity of H. In this lecture we will consider the proble of regression, or learning of real-valued functions. Here we are given a training saple S = x, y,..., x, y X Y for soe Y R, assued now to be drawn fro D for soe distribution D on X Y, and the goal is to learn fro this saple a real-valued function f S : X R that has low generalization error er l D [f S] w.r.t. soe appropriate loss function l : Y R [0,, such as the squared loss l sq given by l sq y, ŷ = ŷ y. As before, we will be interested in obtaining high confidence bounds on the generalization error of the learned function, er l D [f S]. Again, we will consider learning algoriths that learn f S fro a function class F R X of liited capacity, and obtain generalization error bounds for such algoriths via a unifor convergence result that upper bounds the probability P S D sup er l D[f] er l S[f]. However in order to derive such a result, we will need a different notion of capacity for a class of real-valued functions F. In particular, we will use the covering nubers of F, which will play a role analogous to that played by the growth function in the case of binary-valued function classes. Covering Nubers We start by considering covering nubers of subsets of a general etric space. We will then specialize this to subsets of Euclidean space, and use this to define covering nubers for a real-valued function class.. Covering Nubers in a General Metric Space Let A, d be a etric space. Let W A and let > 0. A set C W is said to be a proper -cover of W w.r.t. d if for every w W, c C such that dw, c <. In other words, C W is an -cover of W w.r.t. d if the union of open d-balls of radius centered at points in C contains W : 3 B d, c W. c C If W has a finite -cover w.r.t. d, then we define the -covering nuber of W w.r.t. d to be the cardinality of the sallest -cover of W : N, W, d = in C C is an -cover of W w.r.t. d. Recall that a etric space A, d consists of a set A together with a etric d : A A [0, that satisfies the following for all x, y, z A: dx, y = 0 x = y; dx, y = dy, x; and 3 dx, z dx, y + dx, z. Soeties it is also convenient to consider iproper covers C A which need not be contained in W. 3 Recall that the open d-ball centered at x A is defined as B d, x = y A dx, y <.

Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension If W does not have a finite -cover w.r.t. d, we take N, W, d =. Thus the covering nubers of W can be viewed as easuring the extent of W in A, d at granularity or scale.. Covering Nubers in Euclidean Space Consider now A = R n. following: We can define a nuber of different etrics on R n, including in particular the d x, x = n x i x n i 3 i= d x, x = n x i x i n i= d x, x = ax x i x i. 5 i Accordingly, for any W R n, we can define the corresponding covering nubers N, W, d p for p =,,. It is easy to see that d x, x d x, x by Jensen s inequality and that d x, x d x, x, fro which it follows that the corresponding covering nubers satisfy the relation N, W, d N, W, d N, W, d. 6.3 Unifor Covering Nubers for a Real-Valued Function Class Now let F be a class of real-valued functions on X, and let x = x,..., x X. Then F x R. For any > 0 and N, the unifor d p covering nubers of F for p =,, are defined as N p, F, = ax N, F x X x, d p 7 if N, F x, d p is finite for all x X, and N p, F, = otherwise. This should be copared with the definition of growth function for a class of binary-valued functions H, which also involved a axiu over x X : in that case, H x was finite, and the axiu was over the cardinality of H x ; here, F x ay in general be infinite, and the axiu is over the extent of F x in R at scale, as easured using the etric d p. In particular, for H, X, we have that for any, N, H x, d = H x, and therefore N, H, = Π H. Thus the unifor covering nubers can be viewed as generalizing the notion of growth function to classes of real-valued functions. Note that the ter unifor here refers to the axiu over all x X of the covering nubers of F x in R, and is unrelated to the use of the ter unifor in unifor convergence, which refers to the supreu over functions f F. Unless otherwise stated, in what follows we will refer to the unifor covering nubers of a function class F as siply the covering nubers of F. 3 Unifor Convergence in a Real-Valued Function class F We can assue that functions in F take values in soe set Ŷ R, so that F ŶX. We will require the loss function l to be bounded, i.e. we will assue B > 0 such that 0 ly, ŷ B y Y, ŷ Ŷ. We will find it useful to define for any function class F ŶX and loss l : Y Ŷ [0, B] the loss function class l F [0, B] X Y given by l F = l f : X Y [0, B] l f x, y = ly, fx for soe f F. 8 We will first prove a unifor convergence result for general losses l as above in ters of the d covering nubers of the loss function class l F, and will then show that for any losses l, including the squared loss when Y and Ŷ are bounded, the d covering nubers of l F can further be bounded in ters of the d covering nubers of F.

Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension 3 Theore 3.. Let Y, Ŷ R. Let F ŶX, and let l : Y Ŷ [0, B]. Let D be any distribution on X Y. For any > 0, P S D sup er l D[f] er l S[f] N /8, lf, e /3B. 9 Proof. The proof uses siilar techniques as in the proof of unifor convergence for the l 0- loss in the binary case that we saw in Lecture 3, and has the sae broad steps. The key difference is in the reduction to a finite class step 3. Step : Syetrization. Following the sae steps as in Lecture 3, we can show that for 8B, P S D sup er l D[h] er l S[h] P S, S D D sup er l S[f] er l S[f]. 0 Step : Swapping perutations. Again using the sae arguent as in Lecture 3, we can show that P S, S D D sup er l S[f] er l S[f] ] [P σ Γ sup er l σs [f] erl σ S [f]. sup S, S X Y Step 3: Reduction to a finite class. Fix any S, S X Y, and for siplicity, define x +i, y +i = x i, ỹ i i []. Now consider l F S, S [0, B]. Let G F be such that l G S, S is an /8-cover of l F S, S w.r.t. d. Clearly, we can take G = N /8, l F S, S, d N /8, l F, ; since l F aps to a bounded interval, this is a finite nuber. Now consider any σ Γ. We clai that if f F such that er l σs [f] erl σ S [f], then g G such that er l σs [g] erl σ S [g]. 3 To see this, let f F be such that holds. Take any g G for which i= l f x i, y i l g x i, y i < 8. Such a g exists since l G S, S is an /8-cover of l F S, S w.r.t. d. We will show that g satisfies 3. In particular, we have er l σs [f] erl σ S [f] = = < er l σs [f] erl σs [g] er l er σs [f] erl σs [g] + l er l σs [g] erl σ S [g] + er l σs [g] erl σ S [g] + er l σs [g] erl σ S [g] + 5 er lσ S [f] erlσ S [g] + er l σs [g] erl σ S [g] 6 er σ S [f] erl σ S [g] + l σs [g] er l σ S [g] 7 l f x i, y i l g x i, y i + l f x σi, y σi l g x σi, y σi 8 i= i= i=+ l f x σi, y σi l g x σi, y σi 9 by. 0 Y can be thought of as the space of true labels, and Ŷ the space of predicted labels.

Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension The clai follows. Thus we have P σ Γ sup er l σs [f] erl σ S [f] P σ Γ ax er l σs [g] g G erl σ S [g] er N /8, l F, ax P l σ Γ σs [g] er l g G σ S [g] by union bound. Step : Hoeffding s inequality. As in Lecture 3, Hoeffding s inequality can now be used to show that for any g G, er l P σ Γ σs [g] er l σ S [g] 3 = P r, r i l g x i, y i l g x i, y i i= e /3B. 5 Putting everything together yields the desired result for 8B ; for < 8B, the result holds trivially. The above result yields a high-confidence bound on the generalization error of a function learned fro F in ters of covering nubers of l F. For well-behaved loss functions l, these can be further bounded in ters of covering nubers of F: Lea 3.. Let Y, Ŷ R. Let F ŶX, and let l : Y Ŷ [0, B]. If l is Lipschitz in its second arguent with Lipschitz constant L > 0, i.e. ly, ŷ ly, ŷ L ŷ ŷ y Y, ŷ, ŷ Ŷ, then for any N, N, lf, N /L, F,. Proof. Let S = x, y,..., x, y X Y, and let f, g F. Then l f x i, y i l g x i, y i = i= L ly i, fx i ly i, gx i 6 i= fx i gx i. 7 i= Thus any d /L-cover for F x is a d -cover for l F S, which iplies the result. This yields the following corollary: Corollary 3.3. Let Y, Ŷ R, F ŶX, and l : Y Ŷ [0, B] such that l is Lipschitz in its second arguent with Lipschitz constant L > 0. Let D be any distribution on X Y. Then for any > 0: P S D sup er l D[f] er l S[f] N /8L, F, e /3B. 8 As an exaple, consider the squared loss l sq y, ŷ = ŷ y. It can be shown that if Y, Ŷ are bounded, then l sq is bounded and Lipschitz. In particular, for Y = Ŷ = [, ], we have 0 l sqy, ŷ y Y, ŷ Ŷ,

Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension 5 and l sq is Lipschitz with Lipschitz constant L = : l sq y, ŷ l sq y, ŷ = = y ŷ y ŷ 9 ŷ ŷ yŷ ŷ 30 ŷ + ŷ ŷ ŷ + y ŷ ŷ 3 ŷ ŷ since y, ŷ, ŷ [, ]. 3 Thus when both labels and predictions are in [, ], one gets for the squared loss: Corollary 3.. Let Y = Ŷ = [, ], F ŶX, and l sq : Y Ŷ [0, ] be given by l sqy, ŷ = ŷ y. Let D be any distribution on X Y. Then for any > 0: P S D sup er sq D [f] ersq S [f] N /3, F, e /5. 33 Exercise. Show that for Y = Ŷ = [, ], the absolute loss given by l absy, ŷ = ŷ y y Y, ŷ Ŷ is bounded and is Lipschitz in its second arguent with Lipschitz constant L =. Pseudo-Diension and Fat-Shattering Diension Just as the growth function Π H needed to be sub-exponential in for the unifor convergence result in the binary classification case to be eaningful, the covering nubers N /8, l F, or N /8L, F, need to be sub-exponential in for the above result to be eaningful. In the binary case, we saw that if the VC-diension of H is finite, then the growth function of H grows polynoially in. Analogous results can be shown to hold for the covering nubers. We provide soe basic definitions and results here; further details can be found for exaple in []. Definition Pseudo-diension. Let F R X and let x = x,..., x X. We say x is pseudoshattered by F if r = r,..., r R such that b = b,..., b,, f b F such that signf b x i r i = b i i []. The pseudo-diension of F is the cardinality of the largest set of points in X that can be pseudo-shattered by F: PdiF = ax N x X such that x is pseudo-shattered by F. If F pseudo-shatters arbitrarily large sets of points in X, we say PdiF =. Fact. If F is a vector space of real-valued functions, then PdiF = dif. For exaple, for the class of all affine functions over R n given by F = f : R n R fx = w x + b for soe w R n, b R, we have PdiF = n +. Clearly, if F F, PdiF PdiF. Definition Fat-shattering diension. Let F R X and let x = x,..., x X. Let γ > 0. We say x is γ-shattered by F if r = r,..., r R such that b = b,..., b,, f b F such that b i f b x i r i γ i []. The γ-diension of F or the fat-shattering diension of F at scale γ is the cardinality of the largest set of points in X that can be γ-shattered by F: fat F γ = ax N x X such that x is γ-shattered by F. If F γ-shatters arbitrarily large sets of points in X, we say fat F γ =. Clearly, fat F γ PdiF γ > 0. The fat-shattering diension is often called a scale-sensitive diension since it depends on the scale γ. Both quantities, when finite, can be used to bound the covering nubers of a function class F whose functions take values in a bounded range:

6 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Theore.. Let F [a, b] X for soe a b. Let 0 < b a, and let fat F /8 = d <. Then for d, d log /d N, F, = O. Theore.. Let F [a, b] X N, for soe a b. Let PdiF = d <. Then for all 0 < b a and d N, F, = O. 5 Next Lecture In the next lecture, we will return to binary classification, but will focus on learning functions of the for hx = signfx for soe real-valued function f, and will see how the quantities considered in this lecture can be useful in such situations. References [] Martin Anthony and Peter L. Bartlett. Learning in Neural Networks: Theoretical Foundations. Cabridge University Press, 999.