VC dimension and Model Selection

Similar documents
PAC Model and Generalization Bounds

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Introduction to Machine Learning CMU-10701

Generalization, Overfitting, and Model Selection

Introduction to Machine Learning

Computational Learning Theory. CS534 - Machine Learning

Does Unlabeled Data Help?

Statistical Learning Learning From Examples

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Statistical and Computational Learning Theory

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Generalization theory

Computational and Statistical Learning theory

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization and Overfitting

Statistical learning theory, Support vector machines, and Bioinformatics

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

CS 6375: Machine Learning Computational Learning Theory

2 Upper-bound of Generalization Error of AdaBoost

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Computational Learning Theory

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Computational Learning Theory

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Computational Learning Theory

Computational and Statistical Learning Theory

Generalization, Overfitting, and Model Selection

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Understanding Generalization Error: Bounds and Decompositions

The Vapnik-Chervonenkis Dimension

Solving Classification Problems By Knowledge Sets

ECS171: Machine Learning

Introduction to Machine Learning

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

Computational Learning Theory

Probably Approximately Correct (PAC) Learning

Computational Learning Theory (VC Dimension)

Computational and Statistical Learning Theory

Computational Learning Theory. Definitions

MACHINE LEARNING. Probably Approximately Correct (PAC) Learning. Alessandro Moschitti

Machine Learning. Lecture 9: Learning Theory. Feng Li.

1 The Probably Approximately Correct (PAC) Model

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Day 3: Classification, logistic regression

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Machine Learning

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

Classification: The PAC Learning Framework

Generalization bounds

PAC-learning, VC Dimension and Margin-based Bounds

Part of the slides are adapted from Ziko Kolter

Computational and Statistical Learning Theory

Machine Learning And Applications: Supervised Learning-SVM

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Machine Learning

Introduction to Machine Learning

Empirical Risk Minimization

The sample complexity of agnostic learning with deterministic labels

Computational and Statistical Learning Theory

Introduction to Machine Learning

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Computational Learning Theory (COLT)

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

The definitions and notation are those introduced in the lectures slides. R Ex D [h

Computational and Statistical Learning Theory

Foundations of Machine Learning

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Introduction to Machine Learning (67577) Lecture 3

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Computational and Statistical Learning Theory

References for online kernel methods

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension

Computational Learning Theory

Dan Roth 461C, 3401 Walnut

Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

COMS 4771 Introduction to Machine Learning. Nakul Verma

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Advanced Introduction to Machine Learning CMU-10715

Understanding Machine Learning A theory Perspective

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Lecture 15: Neural Networks Theory

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

The Perceptron algorithm

The Sample Complexity of Revenue Maximization in the Hierarchy of Deterministic Combinatorial Auctions

Machine Learning Lecture 7

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

Qualifying Exam in Machine Learning

1 Learning Linear Separators

ε-nets and VC Dimension

1 A Lower Bound on Sample Complexity

Optimization Methods for Machine Learning (OMML)

Transcription:

VC dimension and Model Selection

Overview PAC model: review VC dimension: Definition Examples Sample: Lower bound Upper bound!!! Model Selection Introduction to Machine Learning 2

PAC model: Setting A distribution: D (unknown) Target function: c t from C c t : X {0,1} Hypothesis: h from H h: X {0,1} Error probability: error(h) = Prob D [h(x) c t (x)] Oracle: EX(c t,d) Intro. to Machine Learning 2016 3

PAC model: Definition C and H are concept classes over X. C is PAC learnable by H if There Exist an Algorithm A such that: For any distribution D over X and c t in C for every input e and d: outputs a hypothesis h in H, while having access to EX(c t,d) with probability 1-d we have error(h) < e Complexities: sample, running time Intro. to Machine Learning 2016 4

PAC model last week For a finite hypothesis class H, sample size m: Realizable case: m > (1/e) ln ( H /d) Non-realizable m > (2/e 2 ) ln (2 H /d) Impossibility results: m > (1/2) log H m > (1/4e) ln (1/d) Introduction to Machine Learning 5

VC dimension: motivation Infinite hypothesis class Threshold Rectangles TODAY: general VC dimension Applies both to realizable and non-realizable. Introduction to Machine Learning 6

VC dimension: definition Notation: C concept class; S - sample Projection: Π C S = c S c C Shattering: C shatters S if Π C S = 2 S VC dimension: size of largest S shattered max d: S, S = d, Π C S = 2 S If no max then infinity For every d there is a shattered set of size d Introduction to Machine Learning 7

VC dimension: Threshold c θ (x) = I(x θ) VC 1 S={0.5}: c 0.3 0.5 = 1 and c 0.6 0.5 = 0 VC <2 S = z 1, z 2 Assume z 1 < z 2 c z 1 = 1, c(z 2 ) = 0 Introduction to Machine Learning 8

VC dimension: union of intervals Intervals on [0,1] Finite but unbounded For any d points: VC-dim = infinity Introduction to Machine Learning 9

VC dimension: convex polygon Convex polygon For any d points VC dimension = infinity Introduction to Machine Learning 10

VC dimension: hyperplane c w,θ x = sign( i w i x i + θ) VC dimension d+1 S = 0, e 1,, e d Given labeling L in {-1,+1}, define c w,θ x : w i = L(x i ) θ = L 0 2 c w,θ 0 = sign(θ) & c w,θ e i = sign(w i + θ) Introduction to Machine Learning 11

VC dimension: hyperplane VC dimension < d+2 For contradiction Assume there is a shattered S and S =d+2 Radom Theorem: S R d, S d + 2 S S conv S conv S S Let S be positive and S-S be negative Let c w,θ be the separating hyperplane Let POS be the positive and NEG be negatives of c w,θ Introduction to Machine Learning 12

VC dimension: hyperplane conv S POS & conv S S NEG closed under convex combinations Radom Theorem: conv S conv S S However: POS NEG = Contradiction! There is no such set S VC-dim < d+2 QED Introduction to Machine Learning 13

VC dim: Sample lower bound Theorem: VC-dim(C)=d+1 m d 16ε Proof: Let {z 0, z 1,, z d } D x 1 8ε x = z 0 = 8ε x = z i 0 otherwize d Target function: c t z 0 = 1; c t z i = 0 or 1 (prob 1 2 ) RARE = {z 1,, z d } Assume S RARE d 2 UNSEEN d 2 Pr error 1 2 2ε 8ε d UNSEEN Introduction to Machine Learning 14

VC dim: Sample lower bound E S RARE = 8εm d 2 Pr S RARE d 2 1 2 With probability at least ½ Error at least 2ε QED Introduction to Machine Learning 15

VC dim: sample upper bound Incorrect proof For sample S: C S = Π C S is finite Use finite class bound: m 1 ε log Π C S δ Problem: S defines C S = Π C S Solution Take 2m points S = S 1 S 2 The randomization in the split to S 1 and S 2 Benefit: We have Π C S Introduction to Machine Learning 16

VC dim: sample upper bound Bad concepts Bad = {h error(h)>ε} Hitting set S: For every h in Bad Exists x in S c t (x) h(x) Goal Compute prob. of S being a hitting set Event A: S 1 not hitting set Exists h in Bad which is consistent Pr[A] <??? Event B: Exists h in Bad h consistent with S 1 h has εm errors on S 2 Introduction to Machine Learning 17

VC dim: sample upper bound Pr[B]=Pr[B A]Pr[A] Since B implies A Pr[B A] Fix such an h Expected errors εm Probability at least ½ Result: 2 Pr[B] Pr[A] F = Π C S 1 S 2 Fix h in F: h consistent with S 1 h has errors εm on S 2 l number of errors Compute the prob. over partitions S 1 and S 2 Introduction to Machine Learning 18

VC dim: sample upper bound Number of total partitions: 2m l Number of partitions which make h consistent on S 1 m l Prob bound m l 2m l l 1 m i = i=0 1 2m i 2 l Bounding probabilities: Union bound over h in F Pr[B] F 2 -εm Pr[A] 2Pr[B] 2 F 2 -εm Introduction to Machine Learning 19

VC dim: sample upper bound High confidence δ 2 F 2 εm m 1 ε log 2 F δ Need to bound F F = Π C S 1 S 2 Sauer-Shelah Lemma: VC-dim(C)=d S =2m d Π C S i=0 Bound 2 m 2(2m) d for m d for m>d 2m i Introduction to Machine Learning 20

VC dim: Sampling Theorem Sample bound m 1 ε log 4 2m d δ m 2 + 1 ε ε log 1 δ +d ε m = O( 1 ε log 1 δ +d ε log d ) ε log 2m Non-realizable m = O( 1 ε 2 log 1 δ + d ε 2 log d ε ) Realizable case Proof methodology Introduction to Machine Learning 21

Rademacher Complexity Motivation: Tighter bounds; Dist. Dependent Notation: f 1, +1 ; f F Pr σ i = +1 = Pr σ i = 1 = 1 2 Introduction to Machine Learning 22

Rademacher Complexity Definition (Radmacher Complexity): S sample of size m R S F R D F = E σ [max f F i=1 m σ i f(x i )] = E S [R S (F)] Introduction to Machine Learning 23

Rademacher Complexity: expected overfitting Theorem (expected overfitting): E S Proof: max f F 1 m Two sample trick, add S = E S max f F m i=1 f xi E D [f x ] 2R D (F) 1 m m i=1 f xi E S [ 1 m m i=1 f xi ] E S,S max f F 1 m m i=1 f xi f(x i ) Introduction to Machine Learning 24

Rademacher Complexity: expected overfitting = E S,S max f F 1 m m i=1 σi (f x i f(x i )) E S max f F 1 m m i=1 σi f x i +E S max f F 1 m m i=1 σi f x i = 2R D (F) QED Introduction to Machine Learning 25

Rademachar Theorem With probability 1-δ, for every h H: ε h ε h + R D H + ln 2 δ 2m ε h + R S H + 3 ln(2 δ ) 2m Introduction to Machine Learning 26

Model selection - Outline Motivation Overfitting Structural Risk Minimization Hypothesis Validation Introduction to Machine Learning 27

Motivation: Problems: We have too few examples We have a very rich hypothesis class How can we find the best hypothesis? Alternatively: Usually we choose the hypothesis class How rich of a class we want? How should we go about doing it? Introduction to Machine Learning 28

Overfitting Concept class: Intervals on a line Can classify any training set Zero training error: Is this the only goal?! Introduction to Machine Learning 29

Overfitting: Intervals Can always get zero training error! Are we interested in zero training error?! Introduction to Machine Learning 30

Overfitting: Intervals intervals 0 1 2 3 4 errors 7 3 2 1 0 Introduction to Machine Learning 31

Overfitting Simple concept plus noise A very complex concept insufficient number of examples + noise 1/3 Introduction to Machine Learning 32

Model Selection error train error generalization error complexity penelty complexity Introduction to Machine Learning 33

Theoretical Model Nested Hypothesis classes H 1 H 2 H 3 H i There is a target function c t (x), non-realizable. True errors: ε(h) = Pr [ h c t ] ε i = inf h Hi e(h) ε(h * )= inf i ε i h * is best hypothesis Training error ε h = 1 m m i=1 I[ h c t ] ε i = inf h Hi ε (h) Introduction to Machine Learning 34

Theoretical Model Complexity of h d(h) = min i {h H i } Add a penalty for d(h) minimize: ε(h)+penalty(h) Penalty based. Chose the hypothesis which minimizes: ε(h)+penalty(h) Introduction to Machine Learning 35

Structural Risk Minimization Parameters: λ i and δ i such that: Pr h H i : ε h ε h > λ i δ i i δ i = δ δ i = δ/2 i Implies: with prob. 1-δ Pr h H: ε h ε h > λ d(h) δ d(h) Introduction to Machine Learning 36

Structural Risk Minimization Setting penalty h Finite H i = λ d(h) λ i = log H i /δ m VC-dim(H i )=i λ i = i log i/δ m Introduction to Machine Learning 37

SRM: Performance THEOROM h * : best hypothesis g srm : SRM choice With probability 1-d ε(h * ) ε(g srm ) ε(h * )+ 2 penalty(h * ) Note: bound depends only on h * Introduction to Machine Learning 38

Proof Bounding the error in H i Pr ε(g srm ε g srm > λ srm Pr[ h H srm : ε(h) ε h > λ srm δ srm Bounding the error across H i ε g srm ε g srm λ srm ε h + λ ε h + λ ε g srm ε h + λ srm ε h + 2λ ε(g srm ) QED Introduction to Machine Learning 39

Hypothesis Validation Separate sample to training and selection. Using the training Select from each H i a candidate g i Using the selection sample select between g 1,,g m The split size (1- )m training set m selection set Introduction to Machine Learning 40

Hypothesis Validation: Algorithm Using (1-γ)m examples: S 1 ε 1 h = error on S 1 g i = arg min h H i ε 1 (h) Using γm examples: S 2 ε 2 h = error on S 2 g HV = arg min g i G ε 2(g i ) Return g HV Introduction to Machine Learning 41

Hypo. Validation: Performance Errors ε hv (m) = error of HV Using m examples ε A (m) = error of A Any algorithm Using m examples Selecting g i from H i only restriction on A For example: any penalty function e Theorem: with probability 1-d hv ( m) e 2 A ((1 ) m) ln(2m m / d ) Introduction to Machine Learning 42

Hypo. Validation: Analysis Pr ε g i ε 2 g i > λ 2e λ2 γm Pr i: ε g i ε 2 g i > λ 2 G e λ2γm = δ Since G m: λ = ln 2m/δ γm ε 2 g i + λ ε 2 g i ε 2 g i ε 2 g HV ε 2 g HV ε g HV λ ε 2 g i + 2λ ε g HV Introduction to Machine Learning 43

Summary PAC model Generalization bounds Empirical Risk Minimization VC dimension Rademacher complexity Model Selection Structural Risk Minimization (SRM) Hypothesis selection Introduction to Machine Learning 45