MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension

Similar documents
The Vapnik-Chervonenkis Dimension

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

Yale University Department of Computer Science. The VC Dimension of k-fold Union

Hence C has VC-dimension d iff. π C (d) = 2 d. (4) C has VC-density l if there exists K R such that, for all

The sample complexity of agnostic learning with deterministic labels

A Result of Vapnik with Applications

Introduction to Machine Learning

k-fold unions of low-dimensional concept classes

Probably Approximately Correct Learning - III

CS 6375: Machine Learning Computational Learning Theory

Introduction to Machine Learning CMU-10701

VC-DENSITY FOR TREES

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Finite Automata and Regular Languages (part III)

Relating Data Compression and Learnability

Computational Learning Theory for Artificial Neural Networks

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti

Models of Language Acquisition: Part II

Computational Learning Theory

Computational Learning Theory

arxiv: v1 [cs.lg] 18 Feb 2017

Computational Learning Theory. Definitions

Convex Sets. Prof. Dan A. Simovici UMB

On the Sample Complexity of Noise-Tolerant Learning

Empirical Risk Minimization

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Computational and Statistical Learning Theory

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Quadratic Upper Bound for Recursive Teaching Dimension of Finite VC Classes

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

VC Dimension and Sauer s Lemma

Probabilistic Method. Benny Sudakov. Princeton University

MATH FINAL EXAM REVIEW HINTS

CSE 648: Advanced algorithms

Compression schemes, stable definable. families, and o-minimal structures

In N we can do addition, but in order to do subtraction we need to extend N to the integers

Localization vs. Identification of Semi-Algebraic Sets

2 BAMO 2015 Problems and Solutions February 26, 2015

In N we can do addition, but in order to do subtraction we need to extend N to the integers

The PAC Learning Framework -II

arxiv: v4 [cs.lg] 7 Feb 2016

Grammars (part II) Prof. Dan A. Simovici UMB

TitleOccam Algorithms for Learning from. Citation 数理解析研究所講究録 (1990), 731:

Introduction to Machine Learning

The Power of Random Counterexamples

Introduction to Machine Learning

Walk through Combinatorics: Sumset inequalities.

Course 311: Michaelmas Term 2005 Part III: Topics in Commutative Algebra

Computational and Statistical Learning theory

Computational Learning Theory. CS534 - Machine Learning

On the number of cycles in a graph with restricted cycle lengths

Any Wizard of Oz fans? Discrete Math Basics. Outline. Sets. Set Operations. Sets. Dorothy: How does one get to the Emerald City?

Statistical Learning Learning From Examples

A Theoretical and Empirical Study of a Noise-Tolerant Algorithm to Learn Geometric Patterns

MATH 31BH Homework 1 Solutions

On A-distance and Relative A-distance

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Problem Set 2: Solutions Math 201A: Fall 2016

CSE 20 DISCRETE MATH SPRING

Computational Learning Theory

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Introduction to Statistical Learning Theory

ADVANCED CALCULUS - MTH433 LECTURE 4 - FINITE AND INFINITE SETS

Section 2: Classes of Sets

Problem 1. Answer: 95

2. Metric Spaces. 2.1 Definitions etc.

Generalization, Overfitting, and Model Selection

Computational Learning Theory

SCTT The pqr-method august 2016

Disjoint paths in tournaments

1. Suppose that a, b, c and d are four different integers. Explain why. (a b)(a c)(a d)(b c)(b d)(c d) a 2 + ab b = 2018.

Sections of Convex Bodies via the Combinatorial Dimension

Math 320: Real Analysis MWF 1pm, Campion Hall 302 Homework 8 Solutions Please write neatly, and in complete sentences when possible.

Statistical and Computational Learning Theory

IMO Training Camp Mock Olympiad #2 Solutions

Algebraic Expressions

Topology, Math 581, Fall 2017 last updated: November 24, Topology 1, Math 581, Fall 2017: Notes and homework Krzysztof Chris Ciesielski

Computational Learning Theory

Homework 5. Solutions

Solutions to odd-numbered exercises Peter J. Cameron, Introduction to Algebra, Chapter 3

Climbing an Infinite Ladder

Baltic Way 2003 Riga, November 2, 2003

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning

Learnability and the Vapnik-Chervonenkis Dimension

PAC LEARNING, VC DIMENSION, AND THE ARITHMETIC HIERARCHY

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

Forbidding complete hypergraphs as traces

A Bound on the Label Complexity of Agnostic Active Learning

Vertex colorings of graphs without short odd cycles

Homework #2 Solutions Due: September 5, for all n N n 3 = n2 (n + 1) 2 4

Cographs; chordal graphs and tree decompositions

Section 31. The Separation Axioms

0 Sets and Induction. Sets

Congruence. Chapter The Three Points Theorem

Solutions to Homework Assignment 2

Chapter 1. Measure Spaces. 1.1 Algebras and σ algebras of sets Notation and preliminaries

The information-theoretic value of unlabeled data in semi-supervised learning

A Necessary Condition for Learning from Positive Examples

Transcription:

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 1 / 30

The Vapnik-Chervonenkis dimension of a collection of sets was introduced in [3] and independently in [2]. Its main interest for ML is related to one of the basic models of machine learning, the probably approximately correct PAC learning paradigm as was shown in [1]. Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 2 / 30

The Trace of a Collection of Sets on a Set Let C P(U). The trace of C on K is the collection of sets C K = {K C C C}. If C K equals P(K), then we say that K is shattered by C. This means that there are concepts in C that split K is all 2 K possible ways. concepts. Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 3 / 30

Main Definition The Vapnik-Chervonenkis dimension of the collection C (called the VC-dimension for brevity) is the largest size of a set K that is shattered by C and is denoted by VCD(C). Example The VC-dimension of the collection of intervals in R is 2. Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 4 / 30

Remarks If VCD(C) = d, then there exists a set K of size d such that for each subset L of K there exists a subset C C such that L = K C. Since there exist 2 d subsets of K, there are at least 2 d sets in C, so 2 d C. Thus, VCD(C) log 2 C. If C is finite, then VCD(C) is finite. The converse is false: there exist infinite collections C that have a finite VC-dimension. Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 5 / 30

The tabular form of C K Let U = {u 1,...,u n }, and let θ = (T C,u 1 u 2 u n,r) be a table, where r = (t 1,...,t p ). The domain of each of the attributes u i is the set {0,1}. Each tuple t k corresponds to a set C k of C and is defined by { 1 if u i C k, t k [u i ] = 0 otherwise, for 1 i n. Then, C shatters K if the content of the projection r[k] consists of 2 K distinct rows. Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 6 / 30

Example Let U = {u 1,u 2,u 3,u 4 } and let C be the collection of subsets of U given by C = {{u 2,u 3 }, {u 1,u 3,u 4 }, {u 2,u 4 }, {u 1,u 2 }, {u 2,u 3,u 4 }}. T C u 1 u 2 u 3 u 4 0 1 1 0 1 0 1 1 0 1 0 1 1 1 0 0 0 1 1 1 K = {u 1,u 3 } is shattered by C because r[k] = ((0,1),(1,1),(0, 0),(1,0),(0, 1)) contains the all four necessary tuples (0,1), (1,1), (0,0), and (1,0). On the other hand, it is clear that no subset K of U that contains at least three elements can be shattered by C because this would require r[k] to contain at least eight tuples. Thus, VCD(C) = 2. Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 7 / 30

Remarks Every collection of sets shatters the empty set. If C shatters a set of size n, then it shatters a set of size p, where p n. Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 8 / 30

VC Classes For C and for m N, let Π C [m] be the largest number of distinct subsets of a set having m elements that can be obtained as intersections of the set with members of C, that is, Π C [m] = max{ C K K = m}. We have Π C [m] 2 m ; however, if C shatters a set of size m, then Π C [m] = 2 m. Definition A Vapnik-Chervonenkis class (or a VC class) is a collection C of sets such that VCD(C) is finite. Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 9 / 30

Example Example Let S be the collection of sets {(,t) t R}. Any singleton is shattered by S. Indeed, if S = {x} is a singleton, then P({x}) = {, {x}}. Thus, if t x, we have (,t) S = {x}; also, if t < x, we have (,t) S =, so S S = P(S). There is no set S with S = 2 that can be shattered by S. Indeed, suppose that S = {x,y}, where x < y. Then, any member of S that contains y includes the entire set S, so S S = {, {x}, {x,y}} P(S). This shows that S is a VC class and VCD(S) = 1. Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 10 / 30

Example Consider the collection I = {[a,b] a,b R,a b} of closed intervals. We claim that VCD(I) = 2. There exists a set S = {x,y} such that I S = P(S): consider the intersections [u,v] S =, where v < x, [x ɛ, x+y 2 ] S = {x},,y] S = {y}, [x ɛ,y + ɛ] S = {x,y}, which show that I S = P(S). No three-element set can be shattered by I: Let T = {x,y,z} be a set that contains three elements. Note that any interval that contains x and z also contains y, so it is impossible to obtain the set {x,z} as an intersection between an interval in I and the set T. [ x+y 2 Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 11 / 30

Example: Three-point sets shattered by half-planes Let H be the collection of closed half-planes in R 2, that is, the collection of sets of the form {x = (x 1,x 2 ) R 2 ax 1 + bx 2 c 0,a 0 or b 0}. We claim that VCD(H) = 3. Let P,Q,R R 2 be non-colinear. The family of lines shatters the set {P,Q,R}, so VCD(H) is at least 3. {Q,R} {P} {R} {P,Q} P {P,Q,R} {P,R}{Q} R Q Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 12 / 30

No set that contains at least four points can be shattered by H. Let {P,Q,R,S} be a set such that no three points of this set are collinear. If S is located inside the triangle P,Q,R, then every half-plane that contains P,Q,R will contain S, so it is impossible to separate the subset {P,Q,R}. Thus, we may assume that no point is inside the triangle formed by the remaining three points. Any half-plane that contains two diagonally opposite points, for example, P and R, will contain either Q or S, which shows that it is impossible to separate the set {P,R}. Q P R S No set that contains four points may be shattered by H, so VCD(H) = 3. Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 13 / 30

Example Let R 2 be equipped with a system of coordinates and let R be the set of rectangles whose sides are parallel with the axes x and y. Each such rectangle has the form [x 0,x 1 ] [y 0,y 1 ]. There is a set S with S = 4 that is shattered by R. Indeed, let S be a set of four points in R 2 that contains a unique northernmost point P n, a unique southernmost point P s, a unique easternmost point P e, and a unique westernmost point P w. If L S and L, let R L be the smallest rectangle that contains L. For example, we show the rectangle R L for the set {P n,p s,p e }. P n P e P w Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 14 / 30

Example (cont d) This collection cannot shatter a set of points that contains at least five points. Indeed, let S be a set of points such that S 5 and, as before, let P n be the northernmost point, etc. If the set contains more than one northernmost point, then we select exactly one to be P n. Then, the rectangle that contains the set K = {P n,p e,p s,p w } contains the entire set S, which shows the impossibility of separating the set K. Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 15 / 30

Major Result If a collection of sets C is not a VC class (that is, if the Vapnik-Chervonenkis dimension of C is infinite), then Π C [m] = 2 m for all m N. However, we shall prove that if VCD(C) = d, then Π C [m] is bounded asymptotically by a polynomial of degree d. Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 16 / 30

The Function φ For n,k N and 0 k n define the number ( n ( ) n = k Clearly, ( n 0) = 1 and ( n n) = 2 n. Theorem k i=0 ( ) n i k) as Let φ : N 2 N be the function defined by { 1 if m = 0 or d = 0 φ(d,m) = φ(d,m 1) + φ(d 1,m 1) otherwise. We have for d,m N. φ(d,m) = ( ) m d Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 17 / 30

Proof The argument is by strong induction on s = i + m. The base case, s = 0, implies m = d = 0. Suppose that the equality holds for φ(d,m ), where d + m < d + m. We have: φ(d,m) = φ(d,m 1) + φ(d 1,m 1) (by definition) = d ( m 1 ) i=0 i + d 1 ( m 1 ) i=0 i (by inductive hypothesis) = d ( m 1 ) i=0 i + d ( m 1 ) i=0 i 1 (since ( ) m 1 1 = 0) = ( d (m 1 ) ( i=0 i + m 1 i 1 ) ) = d i=0 ( m i ) ( = m ) d. Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 18 / 30

Sauer-Shelah Theorem Theorem If C is a collection of subsets of S that is a VC-class such that VCD(C) = d, then Π C [m] φ(d,m) for m N, where φ is the function defined above. Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 19 / 30

Proof The argument is by strong induction on s = d + m, the sum of the VCD of C and the size of the set. For the base case, s = 0 we have d = m = 0 and this means that the collection C shatters only the empty set. Thus, Π C [0] = C = 1, and this implies Π C [0] = 1 = φ(0,0). The inductive case: Suppose that the statement holds for pairs (d,m ) such that d + m < s and let C be a collection of subsets of S such that VCD(C) = d. Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 20 / 30

Proof (cont d) Let K be a set of cardinality m and let k 0 be a fixed (but, otherwise, arbitrary) element of K. Consider the trace C K {k0 }. Since K {k 0 } = m 1, we have, by the inductive hypothesis, C K {k0 } φ(d,m 1). Let C be the collection of sets given by C = {G C K k 0 G,G {k 0 } C K }. C = C K {k 0 } because C consists only of subsets of K {k 0 }. The VCD of C is less than d. Indeed, let K be a subset of K {k 0 } that is shattered by C. Then, K {k 0 } is shattered by C, hence K < d. By the inductive hypothesis, C = C K {k0 } φ(d 1,m 1). Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 21 / 30

The collection of sets C K is a collection of subsets of K that can be regarded as the union of two disjoint collections: those subsets in C K that do not contain the element k 0, that is C K {k0 }; those subsets of K that contain k 0. Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 22 / 30

If L is a second type of subset, then L {k 0 } C. Thus, C K = C K {k0 } + C K {k 0 }, so C K φ(d,m 1) + φ(d 1,m 1), which is the desired conclusion. Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 23 / 30

Lemma 1 Lemma For d N and d 2 we have 2 d 1 dd d!. The argument is by induction on d. The basis step, d = 2 is immediate. Suppose the inequality holds for d. We have (d + 1) d+1 (d + 1)! = (d + 1)d d! = dd d! = dd d! (d + 1)d d d ( 1 + 1 d ) d 2 d (by inductive hypothesis) because ( 1 + 1 ) d 1 + d 1 d d = 2. ( 1 + 1 ) d 2 d d Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 24 / 30

Lemma 2 Lemma We have φ(d,m) 2 md d! for every m d and d 1. The argument is by induction on d and n. If d = 1, then φ(1,m) = m + 1 2m for m 1, so the inequality holds for every m 1, when d = 1. If m = d 2, then φ(d,m) = φ(d,d) = 2 d and the desired inequality follows immediately from the previous Lemma. Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 25 / 30

Proof (cont d) Suppose that the inequality holds for m > d 1. We have φ(d,m + 1) = φ(d,m) + φ(d 1,m) It is easy to see that the inequality 2 md 1 (d 1)! (by the definition of φ) 2 md d! + 2 md 1 (d 1)! (by inductive hypothesis) = 2 md 1 (d 1)! ( 1 + m d ( 1 + m d ). ) (m + 1)d 2 d! is equivalent to ( d m + 1 1 + 1 ) d m and, so it is valid. This yields the inequality. Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 26 / 30

Theorem The function φ satisfies the inequality: ( em φ(d,m) < d ) d for every m d and d 1. Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 27 / 30

Proof By Lemma 2, φ(d,m) 2 md d!. Therefore, we need to show only that ( ) d d 2 < d!. e The argument is by induction on d 1. The basis case, d = 1 is immediate. Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 28 / 30

Suppose that 2 ( ) d d e < d!. We have ( ) d + 1 d+1 ( ) d d ( ) d + 1 d d + 1 2 = 2 e e d e ( = 1 + 1 ) d ( ) 1 d d ( ) d d d e 2 (d + 1) < 2 (d + 1), e e because ( 1 + 1 d ) d < e. ( (1 ) ) The last inequality holds because the sequence + 1 d d is an d N increasing sequence whose limit is e. Since 2 ( ) d+1 d+1 ( e < 2 d ) d e (d + 1), by inductive hypothesis we obtain: ( ) d + 1 d+1 2 < (d + 1)!. e Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 29 / 30

Corollary If m is sufficiently large we have φ(d,m) = O(m d ). Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 30 / 30

A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36:929 965, 1989. N. Sauer. On the density of families of sets. Journal of Combinatorial Theory (A), 13:145 147, 1972. V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Applications, 16:264 280, 1971. Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 30 / 30