CSE 648: Advanced algorithms

Similar documents
Lecture 9: Random sampling, ɛ-approximation and ɛ-nets

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Catching Elephants with Mice

The Vapnik-Chervonenkis Dimension

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti

A Simple Proof of Optimal Epsilon Nets

Introduction to Machine Learning

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension

ε-nets and VC Dimension

Sample width for multi-category classifiers

Introduction to Machine Learning CMU-10701

On the Sample Complexity of Noise-Tolerant Learning

Computational Learning Theory

A Note on Smaller Fractional Helly Numbers

Computational Learning Theory

Uniformly discrete forests with poor visibility

frequently eating Lindt chocolate), then we typically sample a subset S of A and see what portion of S lies in r. We want to believe that

Computational Learning Theory

A non-linear lower bound for planar epsilon-nets

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

The 123 Theorem and its extensions

Understanding Generalization Error: Bounds and Decompositions

Statistical and Computational Learning Theory

Introduction to Machine Learning

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Web-Mining Agents Computational Learning Theory

Relating Data Compression and Learnability

Discriminative Learning can Succeed where Generative Learning Fails

Yale University Department of Computer Science. The VC Dimension of k-fold Union

Lecture 16 : Definitions, theorems, proofs. MTH299 Transition to Formal Mathematics Michigan State University 1 / 8

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

VC Dimension and Sauer s Lemma

Generalization, Overfitting, and Model Selection

Part of the slides are adapted from Ziko Kolter

Computational Learning Theory

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

The sample complexity of agnostic learning with deterministic labels

CS 6375: Machine Learning Computational Learning Theory

CS446: Machine Learning Spring Problem Set 4

Exercises for Unit V (Introduction to non Euclidean geometry)

Properly colored Hamilton cycles in edge colored complete graphs

Empirical Risk Minimization

A Large Deviation Bound for the Area Under the ROC Curve

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory (VC Dimension)

Projects in Geometry for High School Students

Generalization bounds

Dominating a family of graphs with small connected subgraphs

Generalization theory

Statistical Learning Learning From Examples

Learning convex bodies is hard

Narrowing confidence interval width of PAC learning risk function by algorithmic inference

2 Upper-bound of Generalization Error of AdaBoost

IFT Lecture 7 Elements of statistical learning theory

Generalization and Overfitting

1 Maximizing a Submodular Function

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Szemerédi-Trotter type theorem and sum-product estimate in finite fields

Introduction to Machine Learning

k-fold unions of low-dimensional concept classes

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

Support Vector Machines Explained

Convex Analysis and Economic Theory AY Elementary properties of convex functions

Independence numbers of locally sparse graphs and a Ramsey type problem

Computational Learning Theory

Computational Learning Theory

The fundamental theorem of calculus for definite integration helped us to compute If has an anti-derivative,

CIS 800/002 The Algorithmic Foundations of Data Privacy September 29, Lecture 6. The Net Mechanism: A Partial Converse

The Pigeonhole Principle

106 CHAPTER 3. TOPOLOGY OF THE REAL LINE. 2. The set of limit points of a set S is denoted L (S)

Hence C has VC-dimension d iff. π C (d) = 2 d. (4) C has VC-density l if there exists K R such that, for all

Solving Classification Problems By Knowledge Sets

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

27th Russian Mathematics Olympiad

On Learnability, Complexity and Stability

List coloring hypergraphs

College of Computer & Information Science Fall 2009 Northeastern University 22 September 2009

1 Lecture 4: Set topology on metric spaces, 8/17/2012

Linear Classification: Linear Programming

COMS 4771 Introduction to Machine Learning. Nakul Verma

Lower Bounds on the Complexity of Simplex Range Reporting on a Pointer Machine

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Vapnik-Chervonenkis Dimension of Neural Nets

Banach-Alaoglu, boundedness, weak-to-strong principles Paul Garrett 1. Banach-Alaoglu theorem

MORE EXERCISES FOR SECTIONS II.1 AND II.2. There are drawings on the next two pages to accompany the starred ( ) exercises.

Detecting a Network Failure

Does Unlabeled Data Help?

Computational Learning Theory

Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012

Models of Language Acquisition: Part II

VC-dimension in model theory and other subjects

Complex numbers, the exponential function, and factorization over C

A note on network reliability

FUNCTIONAL ANALYSIS LECTURE NOTES: COMPACT SETS AND FINITE-DIMENSIONAL SPACES. 1. Compact Sets

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Machine Learning 4771

Transcription:

CSE 648: Advanced algorithms Topic: VC-Dimension & ɛ-nets April 03, 2003 Lecturer: Piyush Kumar Scribed by: Gen Ohkawa Please Help us improve this draft. If you read these notes, please send feedback. 1 Introduction The lack of a theory of perception has led computer scientists to try to solve more modest problems. Pattern classification is one such subfield of computer science which deals with the assignment of a physical object or event to one or more categories [3]. Let us for now call this assignment (or function) f that takes as input a physical object or event and assigns it to a specified category. To simplify our discussion, let the output be binary and the input be R d. We take the domain of the function f to be R d because pattern classification usually involves a data reduction step that extracts the properties or features of the input data and embeds this data in some space. Statistical learning theory [4] shows that it is important to pick f from a class of functions that has small VC-dimension. VC-Dimension is also a central concept for the study of Probably(The probability of the bound failing is small) Approximately Correct Model (when the bound holds the classifier f has low error rate), one of the most general models used in computational learning theory. The idea of VC dimension is central in very popular machine learning methods like Support Vector Machines and Boosting [4]. It also helps us answer questions like How many random samples does a learning procedure need to draw before it has sufficient information to learn an unknown target concept? Here is an example to show the power of random sampling and VC dimension. Lets say we want to estimate the number of people who live within 10 miles of a hospital in a given country. We can do this by sampling and scaling up the results. What s amazing is that the same sample size works for both Japan and India! The magic lies in the concept of VC-Dimensions [2]. 2 VC-Dimension In order to understand how VC-Dimensions work, we start with a few definitions. Here we will use S to denote the cardinality of any set S. Definition 1 A Hypergraph is a pair H = (V, S) where V is a finite set whose elements are called vertices and S is a family of subsets of V, called edges. Definition 2 A Range Space is a pair R = (, R) where is any finite or infinite set whose elements are called points and R is a family of subsets of, called ranges. Definition 3 A set A is said to be Shattered by a range space R = (, R) if A, and all 2 A subsets of A can be obtained by intersecting A with members of R. Definition 4 The Vapnik-Chervonenkis (VC) Dimension of a Range Space (, R) is the cardinality d of the largest set A that can be shattered. Now lets take a look at an example of how we might calculate the VC-Dimension. Example 5 VC Dimension of Closed Half-Spaces in R 2 1

Let H = (, R) where = R 2 and R = {h A,B : A, B (, + )} where h A,B = {x : Ax+By+1 0} (Simply put, is a plane and R is the set of all half-spaces of.) What is the largest number of points that we can have in A so that we can shatter it? What if we have 3 pts in A? As we can see from Figure 1, all 8 subsets of the three points are intersected by. Therefore this set of three points is shattered. What if we have 4 pts? No, as we can see from Figure 2, we can take two points which are at a diagonal and we would not be able to intersect just those two points without intersecting a third point. In this example of closed half-spaces, the VC-Dimension = 3. If we were to change = R 2 to = R d, the VC- Dimension = d + 1. (1) (2) (3) (4) {} {a} {b} {c} (5) (6) (7) (8) {a,c} {a,b} {b,c} {a,b,c} Figure 1: Figure 1: All Subsets are intersected when = 3 2

Figure 2: Unable to intersect only the diagonal solid points Example 6 VC Dimension of All Convex Sets of a Plane A set S in n-dimensional space is called a convex set if the line segment joining any pair of points of S lies entirely in S. Let H = (, R) where = R 2 and R = {All Convex Sets of the plane}. We will choose our A so that our points lie on the unit circle centered at the origin. Claim: The VC-Dimension = for this example. Argument (not so mathematical): Take a subset of 3 points on the circle from A. If we want to intersect 1 point, we can do so by drawing a point. (Figure 3-1) If we want to intersect 2 points, we can draw a line. (Figure 3-2) If we want to intersect 3 points, we can draw a triangle. (Figure 3-3) What if we want to intersect n points? We can just join them in a convex polygon with n vertices(figure 4)! Therefore the VC-Dimension =. (1) (2) (3) (1 pt) (2 pts) (3 pts) Figure 3: Convex Polygon intersecting 1, 2 and 3 points in unit circle centered at the origin. Here are some sample exercises about VC-Dimensions. The solutions are placed at the end of the notes. Exercise 1 - What is the VC-Dimension of axis aligned rectangles in the plane? Exercise 2 - What about convex polygons in the plane with d sides? 3

Figure 4: Convex Polygon intersecting n points 2.1 ε-net Definition 7 Let (, R) be a range space with finite, and let ε [0, 1]. A set N is called an ε-net for the range space (, R) if N S for all S R with S ε. Comment : If µ is a probability measure on, then N is an ε-net with respect to µ if N S S R with µ(s) ε (µ() = 1). 2.2 Simpler Sampling Lemma Lemma 8 Let W be chosen randomly such that W ε. A set of O( 1 ε ln 1 δ ) points sampled independently and uniformly at random from intersects W with probability at least 1 δ. Proof: Any particular sample x is in W with probability at least ε, and not in W with probability at most 1 ε. Therefore the probability that all our samples do not intersect with W is at most δ. P rob(x ɛ W ) ε P rob(x / W (1 ε) P rob (x / W for all 1ε ln 1δ ) samples (1 ε) 1 ε ln 1 δ e ln 1 δ = e ln δ = δ The W depends on, but sampling depends only on ε! Problem: This Lemma only applies for a given random sample W of, not for all large random samples... 3 ε-net Theorem The ε-net Theorem states the following: Theorem 9 Let VC-Dimension of (, R) be d 2 and 0 ε 1 2. ε-net for (, R) of size at most O( d ε ln 1 ε ). 4

A better version of the ε-net theorem follows: Theorem 10 If we choose O( d ε log d ε + 1 ε log 1 δ ) points at random from, then the resulting set N is an ɛ-net with probability δ. Interested readers should look up the proof in Alon and Spencer [1]. 3.1 Constructing N in this Case: Pick d ε ln 1 ε random samples and put them in N. The theorem says that N is an ε-net with probability at least 1 2!! Comment : See references for proofs. 4 Applications As an example of a possible application, let us design a data structure using this algorithm! Let the set of points P = {p 1, p 2, p 3,...p n } where p i R d. We want to query our structure with any c R d if x c r p i P. If not, the algorithm should produce p i P such that p i c < r. Otherwise with high probability x c r for at least 1 ε fraction of the points. We will try to convert the problem into a VC-Dimension problem by converting the spheres of R d into half spaces in R d+1. Consider the set system (P, Spheres in R d ) = (Lifted(P ), Half Spaces in R d+1 ) We know the VC-Dimension of this to be d + 2 from our first example! So by our ε-net theorem we will pick N = O( d ε log d ε + 1 ε log 1 δ ) points. If it fails for some x, we have a witness. Otherwise F = {x P x c < r} i.e. the set where our condition fails to hold. Since N is with high probability a ε-net, we can conclude with high probability that 5 Solutions to Practice Excercises Exercise 1 - What is the VC-Dimension of axis aligned rectangles in the plane? Solution - 4 Exercise 2 - What about convex polygons in the plane with d sides? Solution - 2d + 1 References [1] N. Alon and J. Spencer, The probabilistic method, ACM Press, 1992. [2] B. Chazelle, The discrepancy method: randomness and complexity, Cambridge University Press, 2000. [3] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis, Wiley-Interscience, New ork, 1973. [4] V. N. Vapnik, The nature of statistical learning theory, Springer-Verlag New ork, Inc., 1995. 5