VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Similar documents
Introduction to Machine Learning

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Computational Learning Theory

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory (VC Dimension)

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Computational Learning Theory. Definitions

Computational Learning Theory

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Computational Learning Theory

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

CS 6375: Machine Learning Computational Learning Theory

Generalization, Overfitting, and Model Selection

Active Learning and Optimized Information Gathering

Computational Learning Theory

Web-Mining Agents Computational Learning Theory

Computational Learning Theory

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Computational Learning Theory

The Vapnik-Chervonenkis Dimension

Models of Language Acquisition: Part II

Machine Learning

ECS171: Machine Learning

Generalization and Overfitting

Machine Learning

Introduction to Support Vector Machines

Computational Learning Theory (COLT)

Computational Learning Theory

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Learning Theory Continued

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

The sample complexity of agnostic learning with deterministic labels

Statistical and Computational Learning Theory

Machine Learning: Homework 5

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Online Learning, Mistake Bounds, Perceptron Algorithm

Computational Learning Theory

Computational and Statistical Learning theory

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Computational Learning Theory. CS534 - Machine Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Introduction to Machine Learning

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

Statistical Learning Learning From Examples

Lecture 3: Introduction to Complexity Regularization

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016


Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

On the Sample Complexity of Noise-Tolerant Learning

Dan Roth 461C, 3401 Walnut

CSE 648: Advanced algorithms

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Computational and Statistical Learning Theory

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

TTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model

Generalization theory

Computational Learning Theory for Artificial Neural Networks

Generalization bounds

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension

On Learnability, Complexity and Stability

Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

1 Learning Linear Separators

Understanding Generalization Error: Bounds and Decompositions

Name (NetID): (1 Point)

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Machine Learning CMU-10701

CSE 311: Foundations of Computing. Lecture 23: Finite State Machine Minimization & NFAs

10.1 The Formal Model

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Active learning. Sanjoy Dasgupta. University of California, San Diego

1 A Lower Bound on Sample Complexity

Classification: The PAC Learning Framework

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Lecture 29: Computational Learning Theory

PAC-learning, VC Dimension and Margin-based Bounds

The definitions and notation are those introduced in the lectures slides. R Ex D [h

COMS 4771 Introduction to Machine Learning. Nakul Verma

VC dimension and Model Selection

Learning Theory, Overfi1ng, Bias Variance Decomposi9on

Discriminative Learning can Succeed where Generative Learning Fails

Introduction to Machine Learning

Part of the slides are adapted from Ziko Kolter

Maximal Width Learning of Binary Functions

Introduction to Machine Learning (67577) Lecture 3

Discriminative Models

Maximal Width Learning of Binary Functions

Machine Learning 4771

Computational and Statistical Learning Theory

ORIE 4741: Learning with Big Messy Data. Generalization

Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

1 Learning Linear Separators

Introduction to Machine Learning

Introduction to Machine Learning

COMPUTATIONAL LEARNING THEORY

Transcription:

VC Dimension Review The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. Previously, in discussing PAC learning, we were trying to answer questions about how difficult it might be to learn a particular concept, and how long it would take a learner to do so. In that discussion, we had one sort of big problem: the hypothesis space H had to be finite. This time we hope to discuss PAC learning for infinite hypothesis spaces. To do this we will need to introduce the concept of VC dimension, along with a bunch of other definitions to support it. In what follows, we are considering a binary classification problem from the space of instances X. So, each hypothesis h H should split X into two sets: { x X h(x) = 1} and { x X h(x) = 0}. In this situation, we will say that a dichotomy has been imposed on X by h. Note that either one of the sets above might be the empty set this is still acceptable. Definition: A set of instances S X is shattered by H iff for every possible dichotomy of S, there exists some hypothesis h H that is consistent with this dichotomy. In other words, S is shattered by H if there are enough hypotheses in H to agree with every possible labeling of S. To get a better understanding of this, let s look at some examples. We will start with the simplest example where S is one point, x 1. In this case S has only two possible labelings: x 1 = 0 and x 1 = 1. So now we can easily come up with a hypothesis space with a hypothesis for each of the labelings, i.e. a hypothesis space that shatters S. Let s try H = {h 1 (x) = 1, h 2 (x) = 0 }

for example. In this H the first hypothesis labels everything as 1, and the second labels everything as 0. The first hypothesis is correct for the labeling x 1 = 1, and the second hypothesis is correct for the labeling x 1 = 0. Here is a figure showing a set of three instances shattered by eight hypotheses. Here the hypotheses are indicated by by the ellipses, and classification is determined by being inside or outside of an ellipse. For every possible labeling of the points with 0 s and 1 s, there is an ellipse that agrees with that labeling. There are also several good examples from the lectures. Michael asks us to find the largest number of points in the plane that can be shattered by the hypothesis space of lines. In this example, the lines divide the plane into 0 s and 1 s. It turns out that we can find an arrangement of 3 points that can be shattered by lines, but no arrangement of 4 points that can be.

The image above shows a three point set shattered by lines in the plane. This leads naturally to the following question: for a given instance space X, and hypothesis space H, what is the largest subset of X that can be shattered by H? The size of this subset has special significance, and is termed the VC dimension of as V C(H). H, which is denoted Definition: Given an instance space X, the Vapnik Chervonenkis dimension of H over X is the size of the largest finite subset of X that can be shattered by H. If arbitrarily large subsets of X can be shattered, then V C(H) =. Despite what is written in the lectures, it is best not to think of VC dimension as a measure of the power of a hypothesis space. Instead, it is better to think of VC dimension as a measure of the complexity of a hypothesis space. This distinction is important for heuristic reasons later on. It is important to remember that V C(H) is the size of the largest subset of X that can be shattered by H. In the example above with the lines in the plane, even though V C(H) = 3, we can cook up an example of 3 points in the plane that can not be shattered by H. For example, the three colinear points in the plane shown below can not be shattered by lines.

In general, it is easy to find a lower bound m for V C(H) since we only need to find one example of set of m points that can be shattered by H. It is harder to find an upper bound n for V C(H), because we need to prove that no set of H. n points can be shattered by Example: Let s suppose the hypothesis space H is the set of single intervals in the real line R, and the sample space X is the set of points on the line. Then we can shatter sets of two points with H, so V C(H) 2, but it can be shown that no set of three points can be shattered by H, so we have V C(H) = 2. Example: Now let s suppose that the hypothesis space H is the set of all convex polygons in the plane, with the sample space consisting of points in the plane. Then for any number n, we can choose a polygon with points being the n vertices. We can deform the polygon slightly to leave out any of the vertices, giving an example of an n point set shattered by H. Since n can be arbitrarily large, we see that V C(H) =. We also note that if the hypothesis space is finite, then there is a relationship between the size of the hypothesis space H and V C(H). Suppose V C(H) = d. For d points from X, there are 2 d possible labelings of these points. Each of these labelings requires a separate hypothesis in order for the subset of d points to be shattered. Thus 2 d H. It follows that V C(H) = d log 2 H. Now, returning to PAC learning and ε exhaustion, we note that it is possible (but apparently very difficult) to derive a bound for the number of training samples needed to ε exhaust the version space of

H with probability ( 1 δ ). This bound is given by: m ε 1 (8 V C(H) log (13/ε) 4 log (2/δ) ) 2 + 2, which is analogous to the bound we saw in the case where H was finite (recall, that bound was m 1 ε ( ln (H) + ln (1/δ) ) ). In the expression above, we see that V C(H) the number of training examples needed to space of must be finite for us to bound ε exhaust the version H. This is consistent with the theorem Michael provides in the lecture, which states: Theorem: A concept class V C(C) <. C is PAC learnable if and only if Finally, we recall the heuristic that the complexity of the concept class V C(C) measures (in this case) C. This heuristic is also consistent with the theorem above if the concept class is too complex, i.e. V C(C) =, then roughly speaking, we will have trouble choosing with probability so the concept class C ( 1 δ ) a hypothesis that has low error, and will not be PAC learnable.