Vapnik-Chervonenkis theory

Similar documents
Learning Theory: Lecture Notes

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

1 Definition of Rademacher Complexity

Eigenvalues of Random Graphs

Computational and Statistical Learning theory Assignment 4

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

COS 511: Theoretical Machine Learning

Finding Dense Subgraphs in G(n, 1/2)

1 The Mistake Bound Model

Lecture Notes on Linear Regression

Limited Dependent Variables

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Lecture 2: Prelude to the big shrink

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.

= z 20 z n. (k 20) + 4 z k = 4

Lecture Space-Bounded Derandomization

Generalized Linear Methods

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

10-701/ Machine Learning, Fall 2005 Homework 3

Errors for Linear Systems

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Kernel Methods and SVMs Extension

Lecture 4: Universal Hash Functions/Streaming Cont d

Edge Isoperimetric Inequalities

Supplementary material: Margin based PU Learning. Matrix Concentration Inequalities

Week 5: Neural Networks

More metrics on cartesian products

Excess Error, Approximation Error, and Estimation Error

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

Lecture 4. Instructor: Haipeng Luo

Problem Set 9 Solutions

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Note on EM-training of IBM-model 1

Goodness of fit and Wilks theorem

Lecture 10: May 6, 2013

Homework Assignment 3 Due in class, Thursday October 15

Randomness and Computation

Difference Equations

Dimensionality Reduction Notes 1

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

CSCE 790S Background Results

The Order Relation and Trace Inequalities for. Hermitian Operators

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

11 Tail Inequalities Markov s Inequality. Lecture 11: Tail Inequalities [Fa 13]

and problem sheet 2

18.1 Introduction and Recap

P exp(tx) = 1 + t 2k M 2k. k N

E Tail Inequalities. E.1 Markov s Inequality. Non-Lecture E: Tail Inequalities

Ensemble Methods: Boosting

Feature Selection: Part 1

Linear Regression Analysis: Terminology and Notation

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Lecture 4 Hypothesis Testing

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

Expected Value and Variance

20. Mon, Oct. 13 What we have done so far corresponds roughly to Chapters 2 & 3 of Lee. Now we turn to Chapter 4. The first idea is connectedness.

Convergence of random processes

First day August 1, Problems and Solutions

Notes on Frequency Estimation in Data Streams

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

), it produces a response (output function g (x)

On complexity and randomness of Markov-chain prediction

Split alignment. Martin C. Frith April 13, 2012

Affine transformations and convexity

Complete subgraphs in multipartite graphs

Canonical transformations

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Min Cut, Fast Cut, Polynomial Identities

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Some basic inequalities. Definition. Let V be a vector space over the complex numbers. An inner product is given by a function, V V C

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Week 2. This week, we covered operations on sets and cardinality.

a b a In case b 0, a being divisible by b is the same as to say that

Foundations of Arithmetic

Introduction to Algorithms

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

e - c o m p a n i o n

A be a probability space. A random vector

Exercise Solutions to Real Analysis

CSE 252C: Computer Vision III

CS286r Assign One. Answer Key

APPENDIX A Some Linear Algebra

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

Maximizing the number of nonnegative subsets

Salmon: Lectures on partial differential equations. Consider the general linear, second-order PDE in the form. ,x 2

Linear Feature Engineering 11

/ n ) are compared. The logic is: if the two

} Often, when learning, we deal with uncertainty:

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Lecture 4: September 12

2.3 Nilpotent endomorphisms

Chapter 11: Simple Linear Regression and Correlation

Online Appendix. t=1 (p t w)q t. Then the first order condition shows that

Spectral Graph Theory and its Applications September 16, Lecture 5

Analysis of Discrete Time Queues (Section 4.6)

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Grover s Algorithm + Quantum Zeno Effect + Vaidman

Transcription:

Vapnk-Chervonenks theory Rs Kondor June 13, 2008 For the purposes of ths lecture, we restrct ourselves to the bnary supervsed batch learnng settng. We assume that we have an nput space X, and an unknown dstrbuton D on X { 1,+1}. Gven a tranng set S = ((x 1,y 1 ),(x 2,y 2 ),...,(x m,y m )) drawn from D, our learnng algorthm tres to fnd a hypothess ĥ: X { 1,+1} that wll predct well on future (x,y) examples drawn from D n the sense that the true error E(ĥ) = E (x,y) D I ĥ(x) y ] (1) wll not be too bg. Of course we cannot measure E(ĥ), because we don t know D. What we have nstead s the emprcal error measured on the sample S: E S (ĥ) = 1 m m I ĥ(x ) y ]. (2) =1 Ths s otherwse known as the tranng error, and typcally t s an overoptmstc estmate of the true error, smply because the way most learnng algorthms work s to mplctly or explctly drve down just ths quantty. Some amount of overfttng s then unavodable. The job of generalzaton bounds s to relate these two quanttes, so that we can report thngs lke my algorthm found the hypothess ĥ, on the tranng data t has error E S, and on future data the error s not lkely to be more than ǫ worse. In other words, E(ĥ) ES(ĥ) < ǫ. No matter how we set ǫ, however, a really really bad tranng set (n the sense of a very unrepresentatve sample) can always mslead us even more, so explct bounds of ths form are generally not obtanable. The P part of PAC stands for amng for guarantees of ths form only n the probablstc sense,.e., fndng (ǫ,δ) pars, where both are small postve real numbers, such that P S E(ĥ) E S ǫ ] 1 δ. (3) Here P S stands for probablty over choce of tranng set. 1

Concentraton The fundamental dea behnd all generalzaton bounds s that although t s possble that the same quantty, n our case, the error, wll turn out to be very dfferent on two dfferent samples from the same dstrbuton, ths s not lkely to happen. In general, as the sample sze grows, emprcal quanttes tend to concentrate more and more around ther mean, n our case, the true error. The smplest nequalty capturng ths fact, and one that s key to our development, s Hoeffdng s nequalty, whch states that f Z 1,Z 2,...,Z m are ndependent draws of a Bernoull random varable wth parameter p and γ > 0, then 1 P m m ] Z > p + γ e 2mγ2. =1 Ths fts our problem ncely, because for any h C f we take Z = Ih(x ) y ], Hoeffdng s nequalty says somethng about the probablty of devatons from the true error p = E(h) of the hypothess h. Pluggng n the hypothess ĥ returned by algorthm and blndly applyng Hoeffdng s nequalty gves P E(ĥ) E S(ĥ) > ǫ ] e 2mǫ2, so settng the rght hand sde equal to 1 δ, the PAC bound P E(ĥ) ES(ĥ) > ǫ ] < δ (4) s satsfed when ln (1/δ) ǫ > 2m. Unfortunately, ths smple argument s NOT CORRECT. What goes wrong s that gven the fact that our algorthm returned ĥ, the (x,y ) examples n the sample are not IID samples from D, and consequently nether are the Z s IID samples from Bernoull(E(ĥ)). One way to put ths couplng between the sample and the hypothess n relef s to note that our algorthm s really just a functon A: S ĥ. Ths makes t clear that the emprcal error s a functon of S n two ways: through the hypothess A(S) and the ndvdual tranng examples (x, y) S. Clearly, we can t hold one constant and regard E S as a statstc based on IID draws of the other. 2

Unform convergence Overcomng the problem of the couplng between S and ĥ s a major hurdle n provng generalzaton bounds. The way to proceed s to nstead of focusng on any one partcular h, to focus on all of them smulatenously. In partcular, f we can fnd an (ǫ,δ) par such that or equvalently, P E(h) E S (h) ǫ h C ] 1 δ, P ĥ C such that E(h) E S(h) > ǫ ] < δ, then that (ǫ,δ) par wll certanly satsfy the PAC bound (3). At frst sght ths seems lke a terrble overkll, snce C mght nclude some crazy rregular functons that mght never be chosen by any reasonable algorthm, but these functons mght make our bound very loose. On the other hand, t s worth notng that at least amongst reasonable functons the ĥ chosen by our learnng algorthm s actually lkely to be towards the top of the lst n terms of the magntude of E(h) E S (h), smply because learnng algorthms by ther very nature tend to drve down E S (h). So boundng E(h) E S (h) for all h C mght not be such a crazy thng to do after all. How best to do t s not obvous, though. If C has only a fnte number of hypotheses, the smplstc method s to use the unon bound: P ĥ C such that E(h) E S(h) > ǫ ] h C P E(h) E S (h) > ǫ ] where on the rght hand sde now we are allowed to use the Hoeffdng bound P E(h) E S (h) > ǫ ] e 2mǫ2, leadng to C e 2mγ2 δ and therefore ln C + ln(1/δ) ǫ >. 2m The unon bound s clearly very loose though, and the explct appearance of the number of hypotheses n C s also worryng: what f two hypotheses are almost dentcal? Should we stll count them as separate? Clearly, there must be some more approprate way of quantfyng the rchness of a concept space than just countng the number of hypotheses n C. VC theory s an attempt to do just ths. 3

Symmetrzaton Concentraton nequaltes don t just tell us that on a sngle sample S, E S (h) can t devate too much from ts mean E(h), they also mply that for a par of ndependent samples S 1 and S 2, E S1 (h) can t be very far from E S2 (h). The key dea behnd Vapnk and Chervonenks poneerng work was to use explot ths fact by a process called symmetrzaton to reduce everythng to just lookng at fnte samples. We start wth the followng smple applcaton of Hoeffdng s nequalty. Proposton 1 Let S and S be two ndependent samples of sze 2m drawn from a dstrbuton D on X { 1,+1} and let E and E S be defned as n (1) and (2). Then for any h C, P E(h) E S (h) > ǫ ] 2 P E S (h) E S (h) > ǫ/2 ]. Ths result also readly generalzes to the unform case. Proposton 2 Let S and S be two ndependent samples of sze 2m drawn from a dstrbuton D on X { 1,+1} and let E and E S be defned as n (1) and (2). Then for any h C, ] P sup E(h) E S (h)] > ǫ 2 P sup ES (h) E S (h) ] ] > ǫ/2. h C h C Now let us defne S = S S and ask ourselves: what s the probablty that the errors ncurred by h on S are dstrbuted n such a way that mǫ/2 more of them fall n S than n S? For the sake of smplcty here we only consder the case that exactly 0 errors fall n S and k = mǫ/2 fall n S. Proposton 3 Consder 2m balls of whch exactly k balls are black. If we randomly splt the balls nto two sets of sze m, then the probablty P k,0 that all the black balls end up n the frst set s at most 1/2 k Proof. ( ) 2m P k,0 = / = k k m(m 1)(m 2)...(m k) (2m)(2m 1)...(2m k) < 1/2 k. For the general case of u and u + k balls n the two sets, a smlar combnatoral nequalty holds. The real sgnfcance of symmetrzaton s that t allows us to quantfy the complexty of C n terms of just the jont sample S nstead of ts behavor on the entre nput space. In partcular, n boundng sup h ES (h) E S (h) ], two hypothess h and h only need to be counted as dstnct f they dffer on S: how they behave over the rest of X s mmateral. To be somewhat more explct, we defne the restrcton of h to S as h S : S { 1,+1} h S (x) = h(x), 4

and the correspondng restrcted concept class as C S = { h S h C }. Whle C S s of course a property of S, t s also a characterstc of the entre concept class C n the sense that t s often possble to bound ts sze ndependently of S. The maxmal rate at whch C S grows wth the sze of m, Π C (n) = max U X n C U s called the growth functon. Usng the growth functon and Proposton 3, we can now gve the fnte sample verson of the unon bound: P sup ES (h) E S (h) ] > ǫ/2 h C ] Π C (2m) 2 mǫ/2. The VC dmenson s solely a devce for computng Π C (n). The VC dmenson The concept class C s sad to shatter a set S X f C can realze all possble labelngs of S,.e., f C S = 2 S. The Vapnk-Chervonenks dmenson of C s the sze of the largest subset of S that C can shatter, VC(C) = max S { S X and C S = 2 n }. The followng famous result (called the Sauer-Shelah lemma) tells us how to bound the growth functon n terms of the VC dmenson. Proposton 4 (Sauer-Shelah lemma) Let C be a concept class of VC-dmenson d, and let Π(m) be the correspondng growth functon. Then for m d, Π(m) = 2 m ; and for m > d, Π(m) ( em d ) d. (5) Proof. The m d case s just a restatement of the defnnton of VC dmenson. For m > d we use nducton on m. For m = 2, t s trval to show that (5) holds. Assumng that t holds for m and any d, we now show that t also holds for m+1 and any d. Let S be any subset of X of sze m+1, and fxng some x S, let us wrte t as S = S \x {x}, where S \x = m. Now for any h C S \x, consder ts two possble extensons h + : S { 1,+1} wth h + (x ) = h(x ) for x S and h + (x) = 1 h : S { 1,+1} wth h (x ) = h(x ) for x S and h (x) = 1. Ether both of these hypotheses are n C S or only one of them s. Let U be the subset of C S \x of hypotheses for whch both h + and h are n C S, and let U S = h U {h+,h }. Then we have C S = C S \x + U. 5

By the nductve hypothess C S \x d ) =1. As for the second term, consder that f U shatters any set V, then U S wll shatter V {x}, so VC(U) VC(U S ) 1 d 1, so by the nductve hypothess U d 1 ) =1. Therefore +1 C S d 1 + = + 1, snce d ) =1 s just the number of ways of choosng up to d objects from m + 1, and the sum corresponds to decomposng these choces accordng to whether a partcular object has been chosen or not. Fnally, =1 < d Puttng t all together ) d ( ) ( ) m d ) d m ( )( ) m d < = m d m ) ( d 1 + d m ) d exp(d) d m) d 6