Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Similar documents
Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Computational Learning Theory

Computational Learning Theory

Dan Roth 461C, 3401 Walnut

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Computational Learning Theory

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

CS 6375: Machine Learning Computational Learning Theory

PAC-learning, VC Dimension and Margin-based Bounds

Computational Learning Theory. CS534 - Machine Learning

Least Mean Squares Regression

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Introduction to Machine Learning

Linear Classifiers: Expressiveness

Introduction to Machine Learning CMU-10701

Generalization, Overfitting, and Model Selection

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Learning Theory Continued

COMS 4771 Introduction to Machine Learning. Nakul Verma

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Computational Learning Theory (VC Dimension)

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Machine Learning

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Introduction to Support Vector Machines

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Machine Learning

Computational and Statistical Learning Theory

Computational Learning Theory

Support Vector Machines. Machine Learning Fall 2017

Web-Mining Agents Computational Learning Theory

The Perceptron Algorithm

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

2 Upper-bound of Generalization Error of AdaBoost

Computational Learning Theory

Computational Learning Theory

Statistical and Computational Learning Theory

Computational Learning Theory. Definitions

Computational learning theory. PAC learning. VC dimension.

Empirical Risk Minimization Algorithms

PAC-learning, VC Dimension and Margin-based Bounds

Introduction to Machine Learning (67577) Lecture 5

Computational and Statistical Learning Theory

Learning with multiple models. Boosting.

Support Vector Machines

Introduction to Machine Learning

Lecture 13: Introduction to Neural Networks

The sample complexity of agnostic learning with deterministic labels

VC dimension and Model Selection

ECS171: Machine Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Lecture Topic 4: Chapter 7 Sampling and Sampling Distributions

10.1 The Formal Model

Online Learning, Mistake Bounds, Perceptron Algorithm

1 Proof of learning bounds

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Introduction to Computational Learning Theory

Generalization, Overfitting, and Model Selection

Computational Learning Theory

Computational and Statistical Learning Theory

ICML '97 and AAAI '97 Tutorials

A Course in Machine Learning

TDT4173 Machine Learning

Machine Learning: Homework 5

Computational and Statistical Learning theory

Name (NetID): (1 Point)

Active Learning and Optimized Information Gathering

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

TDT4173 Machine Learning

Qualifying Exam in Machine Learning

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti

Introduction to Machine Learning

Name (NetID): (1 Point)


Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Introduction to Machine Learning

Variance Reduction and Ensemble Methods

The Perceptron algorithm

1 Rademacher Complexity Bounds

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Midterm Exam, Spring 2005

1 Learning Linear Separators

Lecture 8. Instructor: Haipeng Luo

1 Learning Linear Separators

Computational Learning Theory (COLT)

Classification: The rest of the story

Data Mining und Maschinelles Lernen

Generalization and Overfitting

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

1 The Probably Approximately Correct (PAC) Model

Transcription:

Computational Learning Theory: Shattering and VC Dimensions Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1

This lecture: Computational Learning Theory The Theory of Generalization Probably Approximately Correct (PAC) learning Positive and negative learnability results Agnostic Learning Shattering and the VC dimension 2

Infinite Hypothesis Space The previous analysis was restricted to finite hypothesis spaces Some infinite hypothesis spaces are more expressive than others E.g., Rectangles, 17 sides convex polygons vs. general convex polygons Linear threshold function vs. a combination of LTUs Need a measure of the expressiveness of an infinite hypothesis space other than its size The VapnikChervonenkis dimension (VC dimension) provides such a measure What is the expressive capacity of a set of functions? Analogous to H, there are bounds for sample complexity using VC(H) 3

Learning Rectangles Assume the target function is an axis parallel rectangle Y X 4

Learning Rectangles Assume the target function is an axis parallel rectangle Y Points outside are negative Points outside are negative Points inside are positive Points outside are negative Points outside are negative X 5

Learning Rectangles Assume the target function is an axis parallel rectangle Y X 6

Learning Rectangles Assume the target function is an axis parallel rectangle Y X 7

Learning Rectangles Assume the target function is an axis parallel rectangle Y X 8

Learning Rectangles Assume the target function is an axis parallel rectangle Y X 9

Learning Rectangles Assume the target function is an axis parallel rectangle Y X 10

Learning Rectangles Assume the target function is an axis parallel rectangle Y X 11

Learning Rectangles Assume the target function is an axis parallel rectangle Y Will we be able to learn the target rectangle? X 12

Let s think about expressivity of functions There are four ways to label two points And it is possible to draw a line that separates positive and negative points in all four cases Linear functions are expressive enough to shatter 2 points What about fourteen points? 13

Shattering 14

Shattering 15

Shattering This particular labeling of the points can not be separated by any line 16

Shattering Linear functions are not expressive to shatter fourteen points Because there is a labeling that can not be separated by them Of course, a more complex function could separate them 17

Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Intuition: A rich set of functions shatters large sets of points 18

Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Intuition: A rich set of functions shatters large sets of points Example 1: Hypothesis class of left bounded intervals on the real axis: [0,a) for some real number a>0 0 a 19

Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Intuition: A rich set of functions shatters large sets of points Example 1: Hypothesis class of left bounded intervals on the real axis: [0,a) for some real number a>0 0 a 0 a Sets of two points cannot be shattered That is: given two points, you can label them in such a way that no concept in this class will be consistent with their labeling 20

Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 2: Hypothesis class is the set of intervals on the real axis: [a,b],for some real numbers b>a 21

Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 2: Hypothesis class is the set of intervals on the real axis: [a,b],for some real numbers b>a a b 22

Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 2: Hypothesis class is the set of intervals on the real axis: [a,b],for some real numbers b>a a b a b 23

Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 2: Hypothesis class is the set of intervals on the real axis: [a,b],for some real numbers b>a a b a b All sets of one or two points can be shattered But sets of three points cannot be shattered Proof? Enumerate all possible three points 24

Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 3: Half spaces in a plane 25

Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 3: Half spaces in a plane Can one point be shattered? Two points? Three points? Can any three points be shattered? 26

Half spaces on a plane: 3 points Eric Xing @ CMU, 20062015 27

Half spaces on a plane: 4 points 28

Shattering: The adversarial game You An adversary You: Hypothesis class H can shatter these d points Adversary: That s what you think! Here is a labeling that will defeat you. You: Aha! There is a function h H that correctly predicts your evil labeling Adversary: Argh! You win this round. But I ll be back.. 29

Some functions can shatter infinite points! If arbitrarily large finite subsets of the instance space X can be shattered by a hypothesis space H. An unbiased hypothesis space H shatters the entire instance space X, i.e., it can induce every possible partition on the set of all possible instances The larger the subset of X that can be shattered, the more expressive a hypothesis space H is, i.e., the less biased it is 30

VapnikChervonenkis Dimension A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Definition: The VC dimension of hypothesis space H over instance space X is the size of the largest finite subset of X that is shattered by H If there exists any subset of size d that can be shattered, VC(H) >= d Even one subset will do If no subset of size d can be shattered, then VC(H) < d 31

What we have managed to prove Concept class VC Dimension Half intervals 1 Intervals 2 Halfspaces in the plane 3 Why? There is a dataset of size 1 that can be shattered No dataset of size 2 can be shattered There is a dataset of size 2 that can be shattered No dataset of size 3 can be shattered There is a dataset of size 3 that can be shattered No dataset of size 4 can be shattered 32

More VC dimensions Concept class VC Dimension Linear threshold unit in d dimensions d 1 Neural networks Number of parameters 1 nearest neighbors infinite Intuition: A rich set of functions shatters large sets of points 33

Why VC dimension? Remember sample complexity Consistent learners Agnostic learning Sample complexity in both cases depends on the log of the size of the hypothesis space For infinite hypothesis spaces, its VC dimension behaves like log( H ) 34

VC dimension and Occam s razor Using VC(H) as a measure of expressiveness we have an Occam theorem for infinite hypothesis spaces Given a sample D with m examples, find some h H is consistent with all m examples. If Then with probability at least (1d), h has error less than e. 35

VC dimension and Agnostic Learning Similar statement can be made for the agnostic setting as well If we have m examples, then with probability 1!, the true error of a hypothesis h with training error err S (h) is bounded by 36

PAC learning: What you need to know What is PAC learning? Remember: We care about generalization error, not training error Finite hypothesis spaces Connection between size of hypothesis space and sample complexity Derive and understand the sample complexity bounds Count number of hypotheses in a hypothesis class Infinite hypotheses classes What is shattering and VC dimension? How to find VC dimension of simple concept classes? Higher VC dimensions more sample complexity 37

Computational Learning Theory The Theory of Generalization Probably Approximately Correct (PAC) learning Positive and negative learnability results Agnostic Learning Shattering and the VC dimension 38

Why computational learning theory Raises interesting theoretical questions If a concept class is weakly learnable (i.e there is a learning algorithm that can produce a classifier that does slightly better than chance), does this mean that the concept class is strongly learnable? Boosting We have seen bounds of the form true error < training error (a term with!and VC dimension) Can we use this to define a learning algorithm? Structural Risk Minimization principle Support Vector Machine 39